architxt.nlp.utils#

Functions

split_entities(entities, sentences)

Split a list of Entity objects based on their occurrence in different sentences.

split_relations(relations, entities)

Split relations into sentence-specific relationships.

split_sentences(text)

Remove Unicode and split the input text into sentences based on the line breaks.

architxt.nlp.utils.split_entities(entities, sentences)[source]#

Split a list of Entity objects based on their occurrence in different sentences.

Entities are assigned to sentences based on their start and end positions. The function returns a generator of lists, where each list contains the entities corresponding to a specific sentence, with the entity positions adjusted to be relative to the sentence.

Parameters:
  • entities (Iterable[Entity]) – An iterable of Entity objects, each representing a named entity with start and end positions relative to the entire text.

  • sentences (Sequence[str]) – A sequence of sentences corresponding to the text from which the entities are extracted.

Yield:

A list of Entity objects for each sentence, with entity positions relative to that sentence.

>>> e1 = Entity(name="Entity1", start=0, end=5, id="E1")
>>> e2 = Entity(name="Entity2", start=6, end=15, id="E2")
>>> e3 = Entity(name="Entity3", start=21, end=25, id="E3")
>>> result = list(split_entities([e1, e2, e3], ["Hello world.", "This is a test."]))
>>> len(result)
2
>>> len(result[0])
1
>>> len(result[1])
2
>>> result[0][0].name == "Entity1"
True
>>> result[1][0].name == "Entity2"
True
>>> result[1][1].name == "Entity3"
True
Return type:

Generator[list[Entity], None, None]

architxt.nlp.utils.split_relations(relations, entities)[source]#

Split relations into sentence-specific relationships.

It maps the entity IDs to their indices within the corresponding sentence’s entities.

Parameters:
  • relations (Iterable[Relation]) – An iterable of Relation.

  • entities (Sequence[Sequence[Entity]]) – A sequence of sequences, where each inner sequence contains Entity objects corresponding to entities in a sentence.

Return type:

list[list[Relation]]

Returns:

A list of lists. Each inner list corresponds to a sentence and contains Relation objects for that sentence.

>>> e1 = Entity(name="Entity1", start=0, end=1, id="E1")
>>> e2 = Entity(name="Entity2", start=2, end=3, id="E2")
>>> e3 = Entity(name="Entity3", start=4, end=5, id="E3")
>>> e4 = Entity(name="Entity4", start=6, end=7, id="E4")
>>> r1 = Relation(src="E1", dst="E2", name="relates_to")
>>> r2 = Relation(src="E3", dst="E4", name="belongs_to")
>>> result = split_relations([r1, r2], [[e1, e2], [e3, e4]])
>>> len(result)
2
>>> result[0][0] == r1
True
>>> result[1][0] == r2
True
architxt.nlp.utils.split_sentences(text)[source]#

Remove Unicode and split the input text into sentences based on the line breaks.

It is common for brat annotation formats to have one sentence per line.

Parameters:

text (str) – The input text to be split into sentences.

Return type:

list[str]

Returns:

A list of sentences split by line breaks with Unicode removed.

>>> split_sentences("This is à test\nAnothér-test here")
['This is a test', 'Another-test here']