architxt.database.loader.documents

architxt.database.loader.documents#

Functions

`parse_document_tree`(tree, *[, sample])	Parse a document tree and yields processed subtrees based on collection grouping.
`parse_file`(file)	Parse a document database file like XML, JSON, or CSV.
`read_document`(file, *[, raw_read, ...])	Read the file as a data tree.
`read_document_file`(file)	Read and parse a document file like XML, JSON, or CSV.
`read_tree`(data, *[, root_name])	Recursively converts a document nested structure into a tree.
`traverse_tree`(tree, *[, sample])	Recursively traverses and transforms a nested tree into a valid metamodel structure.

architxt.database.loader.documents.parse_document_tree(tree, *, sample=0)[source]#

Parse a document tree and yields processed subtrees based on collection grouping.

If the root node is not a collection, the entire tree is processed and a single result is yielded.
If the root node is a collection, each child subtree is individually processed and yielded.

TODO: Enhance tree decomposition for nested collections.: If no collection exists at the root level, consider splitting at the closest collection and duplicating the path to the root for each collection element.

Parameters:

tree (Tree) – The nested tree to be parsed.
sample (int) – Maximum number of samples to get for each collection. If 0, all samples are returned.

Yield:

Trees representing the database.

Return type:

Generator[Tree, None, None]

architxt.database.loader.documents.parse_file(file)[source]#

Parse a document database file like XML, JSON, or CSV.

Parameters:: file (Union[BytesIO, BinaryIO]) – A file-like object opened for reading.
Return type:: Union[dict[str, Any], list[Any]]
Returns:: The parsed content of the file as a Python nested object.
Raises:: ValueError if none of the available parsers are able to process the input file.

architxt.database.loader.documents.read_document(file, *, raw_read=False, root_name='ROOT', sample=0)[source]#

Read the file as a data tree.

XML is parsed according to https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html

Parameters:

file (Union[str, Path, BytesIO, BinaryIO]) – The document file to read.
raw_read (bool) – If enabled, the tree corresponds to the document without any transformation applied.
root_name (str) – The root node name.
sample (int) – Maximum number of samples to get for each collection. If 0, all samples are returned.

Return type:

Generator[Tree, None, None]

Returns:

A list of trees representing the database.

architxt.database.loader.documents.read_document_file(file)[source]#

Read and parse a document file like XML, JSON, or CSV.

Parameters:

file (Union[str, Path, BytesIO, BinaryIO]) – The document database file to read.

Return type:

Union[dict[str, Any], list[Any]]

Returns:

The parsed contents of the file.

Raises:

FileNotFoundError – If the file does not exist.
OSError – If the file cannot be read.
ValueError – If the file cannot be read or is empty.

architxt.database.loader.documents.read_tree(data, *, root_name='ROOT')[source]#

Recursively converts a document nested structure into a tree.

Dictionaries are treated as groups.
Lists are treated as collections.
Leaf elements are treated as entities.

If a list contains only a single collection, the function flattens the output by returning that collection directly instead of nesting it under another collection node.

Parameters:

data (Union[dict[str, Any], list[Any]]) – The input data structure to be converted into a Tree.
root_name (str) – The label for the current node.

Return type:

Tree

Returns:

A nested tree structure corresponding to the input data.

architxt.database.loader.documents.traverse_tree(tree, *, sample=0)[source]#

Recursively traverses and transforms a nested tree into a valid metamodel structure.

The function extracts entity nodes and groups them under a single group node. It then establishes relations between this group and any nested subgroups.

Parameters:

tree (Tree) – The tree to traverse and transform.
sample (int) – Maximum number of samples to get for each collection. If 0, all samples are returned.

Return type:

tuple[Tree, Tree]

Returns:

A tuple containing: - The group to anchor too for parent relationship. - The transformed tree converting subgroup to relations.

architxt.database.loader.documents

Contents

architxt.database.loader.documents#