Loading document databases#

See also

Fundamentals

Overview of ArchiTXT’s internal data representation.

ArchiTXT supports loading document databases (such as JSON, XML, YAML, TOML, CSV, and Excel) through the architxt.database.loader.documents module. These documents are converted into data Tree corresponding the metamodel.

The document-to-tree conversion process involves three steps:

  1. Read: Detect and parse the input file into a native Python nested structure compose of either dict or list.

  2. Parse: Convert the Python structure into a Tree composed of COLL, GROUP, and ENT nodes.

  3. Transform: Optionally extract relationships implied by nested groups, transforming the tree to align it with the target metamodel.

The decomposition in three steps enables parsing not only supported data formats but also arbitrary Python data structures. The resulting raw trees are not considered valid on their own but can be combined with syntax trees before applying more advanced structuring algorithms.

Parsing nested data structures#

The parsing process is performed via read_tree(). This function traverses nested Python structures and constructs a corresponding Tree based on the following rules:

  • A dict becomes a GROUP node, where each key/value pair is parsed into a subtree.

  • A list becomes a COLL node, where each element is parsed into a subtree.

  • A scalar value (e.g., str, int, float, bool) becomes an ENT node wrapping the value.

Example

Consider the following JSON document:

[
    {
        "userId": 1,
        "username": "johndoe",
        "profile": {
            "firstName": "John",
            "lastName": "Doe",
            "birthDate": "1990-01-01"
        }
    }
]

This input is converted into the following tree structure:

        ---
config:
  theme: neutral
---
graph TD
    users["COLL users"]

    users --> user["GROUP user"]
    user --> userId["ENT userId"] --> userIdVal["1"]
    user --> username["ENT username"] --> usernameVal["johndoe"]

    user --> profile["GROUP profile"]
    profile --> firstName["ENT firstName"] --> firstNameVal["John"]
    profile --> lastName["ENT lastName"] --> lastNameVal["Doe"]
    profile --> birthDate["ENT birthDate"] --> birthDateVal["1990-01-01"]
    

Transforming Raw Trees#

Warning

The transformation described here is specifically designed for tree-like data. Applying it to arbitrary or improperly structured trees may result in invalid or incoherent outputs.

Once a raw tree is constructed, it can be transformed into a flattened structure aligned with the metamodel using parse_document_tree().

This transformation:

  • Converts nested GROUP nodes into REL nodes, establishing explicit relationships between parent and child subtrees.

  • Duplicates the parent node for each nested group while retaining only its direct ENT children as part of the GROUP.

  • If the root of the raw tree is a COLL, the transformation produces a forest; constructing one tree per collection element.

Example

Given the raw tree from the previous example, the transformation produces the following structure that conforms to the ArchiTXT metamodel:

        ---
config:
  theme: neutral
---
graph TD
    root["ROOT"]

    root --> coll["COLL user<->profile"]
    coll --> rel["REL user<->profile"]

    rel --> user["GROUP user"]
    user --> userId["ENT userId"] --> userIdVal["1"]
    user --> username["ENT username"] --> usernameVal["johndoe"]

    rel --> profile["GROUP profile"]
    profile --> firstName["ENT firstName"] --> firstNameVal["John"]
    profile --> lastName["ENT lastName"] --> lastNameVal["Doe"]
    profile --> birthDate["ENT birthDate"] --> birthDateVal["1990-01-01"]
    

Supported File Formats#

ArchiTXT supports a wide range of document formats through pluggable parsers. Each format is handled by a specific backend parser:

  • JSON: json.load()

  • TOML: toml.loads()

  • YAML: ruamel.yaml.YAML.load_all()

Important

Parsers are applied in order; if none succeed, a ValueError is raised.