Loading textual datas

Loading textual datas#

Preparing a Text Corpora#

ArchiTXT requires your text corpora to be in a Gzipped Tar archive using the BRAT annotation format. In this format, the text and annotations are stored in two files with the same name but different extensions (.txt for text and .ann for annotations). For example, doc-123.txt contains the text, while doc-123.ann holds the annotations (named entities, relations, events, etc.). Your archive should look similar to this :

corpus.tar.gz
├── doc-123.txt
├── doc-123.ann
├── doc-124.txt
├── doc-124.ann
├── doc-125.txt
├── doc-125.ann
└── ...

Important

One Sentence per Line: You should have one sentence per line in your text file.
Non-Overlapping Annotations: The start and end offsets for each annotated entity should not overlap with any other entity in the same sentence. Overlapping spans are not yet handle and will be ignored by ArchiTXT.

Annotations#

While ArchiTXT does not include built-in data extraction tools, it operates on text that is accompanied by annotations. These annotations can be directly provided to ArchiTXT or loaded from corpora formatted in the BRAT format, packaged as a zip or tar-archived folder.

Supported annotation types include:

Named Entities: Specific segments of text representing objects or concepts with distinct names or types. For example, the sentence “Alice likes apples”, “Alice” can be annotated as a Person entity, and “apples” as a Fruit entity.
Relations: Binary connections between two entities that express semantic relationships. In “Alice likes apples”, the word “likes” indicates a possible relation between the Person entity “Alice” and the Fruit entity “apples”.

Named entities are interpreted as entities within the ArchiTXT meta-model.

BRAT Annotation Format#

Annotations in BRAT are defined in a simple, tab-delimited format. Each annotation line follows this structure:

T<ID> <Entity_Type> <Start_Offset> <End_Offset> <Entity_Text>

T<ID>: A unique identifier for the annotation (e.g., T1, T2, etc.).
Entity_Type: The category or type of the entity (e.g., Person, Location, Animal).
Start_Offset and End_Offset: The character positions in the sentence where the entity begins and ends.
Entity_Text: The exact text span that has been annotated.

Consider the following sentence in your text file (document.txt):

The quick brown fox jumps over the lazy dog.

An accompanying annotation file (document.ann) might look like:

T1 Animal 16 19 fox
T2 Animal 35 39 dog

In this example:

T1 annotates the entity “fox” starting at character position 16 and ending at 19.
T2 annotates the entity “dog” starting at position 35 and ending at 39.
Both annotations are non-overlapping and correspond to entities in the sentence.

Text to trees#

Consider the example sentence:

The heart rate was 100 bpm

With the following named entities:

“heart rate” annotated as a Sign or Symptom (SOSY)
“100” annotated as a Value
“bpm” annotated as a Unit

ArchiTXT start by extracting the syntax tree of each text. The raw syntax tree for the example sentence is as follows:

        ---
config:
  theme: neutral
---
graph TD;
    S --> NP1["NP"]
    S --> VP
    NP1 --> DT
    NP1 --> NN1["NN"]
    NP1 --> NN2["NN"]
    DT --> The
    NN1 --> heart
    NN2 --> rate
    VP --> VBD
    VBD --> was
    VP --> NP2["NP"]
    NP2 --> CD
    CD --> 100
    NP2 --> NN3["NN"]
    NN3 --> bpm

Entities are then embedded into the tree:

        ---
config:
  theme: neutral
---
graph TD;
    S --> NP1["NP"]
    S --> VP
    NP1 --> DT
    DT --> The
    NP1 --> sosy["ENT SOSY"]
    sosy --> NN1["NN"]
    sosy --> NN2["NN"]
    NN1 --> heart
    NN2 --> rate
    VP --> VBD
    VBD --> was
    VP --> NP2["NP"]
    NP2 --> CD
    CD --> value["ENT VALUE"]
    value --> 100
    NP2 --> NN3["NN"]
    NN3 --> unit["ENT UNIT"]
    unit --> bpm

The tree is then simplified by removing unnecessary branches, focusing only on meaningful entities:

        ---
config:
  theme: neutral
---
graph TD;
    S --> NP1["NP"]
    S --> VP
    NP1 --> sosy["ENT SOSY"]
    sosy --> NN1["NN"]
    sosy --> NN2["NN"]
    NN1 --> heart
    NN2 --> rate
    VP --> NP2["NP"]
    NP2 --> CD
    CD --> value["ENT VALUE"]
    value --> 100
    NP2 --> NN3["NN"]
    NN3 --> unit["ENT UNIT"]
    unit --> bpm

Finally, the tree is reduced by eliminating unnecessary nodes:

        ---
config:
  theme: neutral
---
graph TD;
    S --> sosy["ENT SOSY"]
    sosy --> heart
    sosy --> rate
    S --> VP
    VP --> value["ENT VALUE"]
    value --> 100
    VP --> unit["ENT UNIT"]
    unit --> bpm

Parse your documents#

ArchiTXT provides multiple parsing backends to process documents and extract structured representations.

CoreNLP#

ArchiTXT can use CoreNLP to process the documents. To use this, you need to have a CoreNLP server running with the appropriate language model installed.

Note

A pre-configured Docker setup for a CoreNLP server (supporting both English and French) is available in the GitHub repository. This setup is intended for development use only. For production deployment, consult the official CoreNLP documentation.

To initialize a architxt.nlp.parser.corenlp.CoreNLPParser in ArchiTXT, use:

from architxt.nlp.parser.corenlp import CoreNLPParser

parser = CoreNLPParser(corenlp_url='http://localhost:9000')

Benepar/SpaCy#

ArchiTXT also supports the Benepar parser, which integrates with SpaCy for syntactic parsing. To initialize a architxt.nlp.parser.benepar.BeneparParser, use:

from architxt.nlp.parser.benepar import BeneparParser

parser = BeneparParser(spacy_models={
    'English': 'en_core_web_md',
    'French': 'fr_core_news_md',
})

You need to specify the SpaCy models to use for each language, and they must be installed beforehand. SpaCy provides various models (sm, md, lg) with different sizes and capabilities. You can install them using:

python -m spacy download en_core_web_md
python -m spacy download fr_core_news_md

For a full list of available models, visit the SpaCy model directory.

Entity resolution#

Note

This section is incomplete. Refer to architxt.nlp.entity_resolver for relevant implementation details.

Caching#

Note

This section is incomplete. Refer to architxt.nlp.raw_load_corpus() for relevant implementation details.

Loading textual datas

Contents

Loading textual datas#

Preparing a Text Corpora#

Annotations#

BRAT Annotation Format#

Text to trees#

Parse your documents#

CoreNLP#

Benepar/SpaCy#

Entity resolution#

Caching#