architxt.nlp.utils#
Functions
|
Split a list of Entity objects based on their occurrence in different sentences. |
|
Split relations into sentence-specific relationships. |
|
Remove Unicode and split the input text into sentences based on the line breaks. |
- architxt.nlp.utils.split_entities(entities, sentences)[source]#
Split a list of Entity objects based on their occurrence in different sentences.
Entities are assigned to sentences based on their start and end positions. The function returns a generator of lists, where each list contains the entities corresponding to a specific sentence, with the entity positions adjusted to be relative to the sentence.
- Parameters:
- Yield:
A list of Entity objects for each sentence, with entity positions relative to that sentence.
>>> e1 = Entity(name="Entity1", start=0, end=5, id="E1") >>> e2 = Entity(name="Entity2", start=6, end=15, id="E2") >>> e3 = Entity(name="Entity3", start=21, end=25, id="E3") >>> result = list(split_entities([e1, e2, e3], ["Hello world.", "This is a test."])) >>> len(result) 2 >>> len(result[0]) 1 >>> len(result[1]) 2 >>> result[0][0].name == "Entity1" True >>> result[1][0].name == "Entity2" True >>> result[1][1].name == "Entity3" True
- architxt.nlp.utils.split_relations(relations, entities)[source]#
Split relations into sentence-specific relationships.
It maps the entity IDs to their indices within the corresponding sentence’s entities.
- Parameters:
- Return type:
- Returns:
A list of lists. Each inner list corresponds to a sentence and contains Relation objects for that sentence.
>>> e1 = Entity(name="Entity1", start=0, end=1, id="E1") >>> e2 = Entity(name="Entity2", start=2, end=3, id="E2") >>> e3 = Entity(name="Entity3", start=4, end=5, id="E3") >>> e4 = Entity(name="Entity4", start=6, end=7, id="E4") >>> r1 = Relation(src="E1", dst="E2", name="relates_to") >>> r2 = Relation(src="E3", dst="E4", name="belongs_to") >>> result = split_relations([r1, r2], [[e1, e2], [e3, e4]]) >>> len(result) 2 >>> result[0][0] == r1 True >>> result[1][0] == r2 True
- architxt.nlp.utils.split_sentences(text)[source]#
Remove Unicode and split the input text into sentences based on the line breaks.
It is common for brat annotation formats to have one sentence per line.
- Parameters:
text (
str
) – The input text to be split into sentences.- Return type:
- Returns:
A list of sentences split by line breaks with Unicode removed.
>>> split_sentences("This is à test\nAnothér-test here") ['This is a test', 'Another-test here']