architxt.nlp

architxt.nlp#

Functions

`open_archive`(archive_file)
`raw_load_corpus`(corpus_archives, languages, ...)	Asynchronously loads a set of corpus from disk or in-memory archives, parses it, and returns the enriched forest.

architxt.nlp.open_archive(archive_file)[source]#

Return type:: TarFile

async architxt.nlp.raw_load_corpus(corpus_archives, languages, *, parser, entities_filter=None, relations_filter=None, entities_mapping=None, relations_mapping=None, resolver=None, extractor=None, cache=True, sample=None, batch_size=BATCH_SIZE)[source]#

Asynchronously loads a set of corpus from disk or in-memory archives, parses it, and returns the enriched forest.

This function handles both local and in-memory corpus archives, processes the data based on the specified filters and mappings, and uses the provided CoreNLP server for parsing. Optionally, caching can be enabled to avoid repeated computations. The resulting forest is not a valid database instance it needs to be passed to the automatic structuration algorithm first.

Parameters:

corpus_archives (Sequence[str | Path | BytesIO | BinaryIO]) – A list of corpus archive sources, which can be: - Paths to files on disk, or - In-memory file-like objects. The list can include both local and in-memory sources, and its size should match the length of languages.
languages (Sequence[str]) – A list of languages corresponding to each corpus archive. The number of languages must match the number of archives.
parser (Parser) – The parser to use to parse the sentences.
entities_filter (set[str] | None) – A set of entity types to exclude from the output. If py:None, no filtering is applied.
relations_filter (set[str] | None) – A set of relation types to exclude from the output. If py:None, no filtering is applied.
entities_mapping (dict[str, str] | None) – A dictionary mapping entities names to new values. If py:None, no mapping is applied.
relations_mapping (dict[str, str] | None) – A dictionary mapping relation names to new values. If py:None, no mapping is applied.
extractor (EntityExtractor | None) – The entity extractor to use. If py:None, no extra entity extraction is performed.
resolver (EntityResolver | None) – The entity resolver to use. If py:None, no entity resolution is performed.
cache (bool) – A boolean flag indicating whether to cache the computed forest for faster future access.
sample (int | None) – The number of examples to take in each corpus.
batch_size (int) – The number of sentences to process in each batch. This parameter is used to control the memory usage.

Return type:

AsyncGenerator[Tree, None]

Returns:

A forest containing the parsed and enriched trees.

Modules

`brat`	Dataset loader for BRAT (BRAT Rapid Annotation Tool) format.
`entity_extractor`
`entity_resolver`
`model`
`parser`
`utils`

architxt.nlp

Contents

architxt.nlp#