architxt.nlp

architxt.nlp#

Functions

open_archive(archive_file)

raw_load_corpus(corpus_archives, languages, ...)

Asynchronously loads a set of corpus from disk or in-memory archives, parses it, and returns the enriched forest.

architxt.nlp.open_archive(archive_file)[source]#
Return type:

Union[ZipFile, TarFile]

async architxt.nlp.raw_load_corpus(corpus_archives, languages, *, parser, entities_filter=None, relations_filter=None, entities_mapping=None, relations_mapping=None, resolver_name=None, cache=True, sample=None)[source]#

Asynchronously loads a set of corpus from disk or in-memory archives, parses it, and returns the enriched forest.

This function handles both local and in-memory corpus archives, processes the data based on the specified filters and mappings, and uses the provided CoreNLP server for parsing. Optionally, caching can be enabled to avoid repeated computations. The resulting forest is not a valid database instance it need to be passed to the automatic structuration algorithm first.

Parameters:
  • corpus_archives (Sequence[Union[str, Path, BytesIO, BinaryIO]]) – A list of corpus archive sources, which can be: - Paths to files on disk, or - In-memory file-like objects. The list can include both local and in-memory sources, and its size should match the length of languages.

  • languages (Sequence[str]) – A list of languages corresponding to each corpus archive. The number of languages must match the number of archives.

  • parser (Parser) – The parser to use to parse the sentences.

  • entities_filter (Optional[set[str]]) – A set of entity types to exclude from the output. If py:None, no filtering is applied.

  • relations_filter (Optional[set[str]]) – A set of relation types to exclude from the output. If py:None, no filtering is applied.

  • entities_mapping (Optional[dict[str, str]]) – A dictionary mapping entity names to new values. If py:None, no mapping is applied.

  • relations_mapping (Optional[dict[str, str]]) – A dictionary mapping relation names to new values. If py:None, no mapping is applied.

  • resolver_name (Optional[str]) – The name of the entity resolver to use. If py:None, no entity resolution is performed.

  • cache (bool) – A boolean flag indicating whether to cache the computed forest for faster future access.

  • sample (Optional[int]) – The number of examples to take in each corpus.

Return type:

list[Tree]

Returns:

A forest containing the parsed and enriched trees.

Modules

brat

Dataset loader for BRAT (BRAT Rapid Annotation Tool) format.

entity_resolver

model

parser

utils