Loading document databases

Loading document databases#

Parsing nested data structures#

The parsing process is performed via read_tree(). This function traverses nested Python structures and constructs a corresponding Tree based on the following rules:

A dict becomes a GROUP node, where each key/value pair is parsed into a subtree.
A list becomes a COLL node, where each element is parsed into a subtree.
A scalar value (e.g., str, int, float, bool) becomes an ENT node wrapping the value.

Transforming Raw Trees#

Warning

The transformation described here is specifically designed for tree-like data. Applying it to arbitrary or improperly structured trees may result in invalid or incoherent outputs.

Once a raw tree is constructed, it can be transformed into a flattened structure aligned with the metamodel using parse_document_tree().

This transformation:

Converts nested GROUP nodes into REL nodes, establishing explicit relationships between parent and child subtrees.
Duplicates the parent node for each nested group while retaining only its direct ENT children as part of the GROUP.
If the root of the raw tree is a COLL, the transformation produces a forest; constructing one tree per collection element.

Supported File Formats#

ArchiTXT supports a wide range of document formats through pluggable parsers. Each format is handled by a specific backend parser:

JSON: json.load()
TOML: toml.loads()
YAML: ruamel.yaml.YAML.load_all()

XML: xmltodict.parse()
CSV: pandas.read_csv()
Excel: pandas.read_excel()

Important

Parsers are applied in order; if none succeed, a ValueError is raised.