architxt.schema#

Classes

Schema(start, productions[, ...])

class architxt.schema.Schema(start, productions, calculate_leftcorners=True)[source]#

Bases: CFG

classmethod from_description(*, groups=None, rels=None, collections=True)[source]#

Create a Schema from a description of groups, relations, and collections.

Parameters:
  • groups (Optional[dict[str, set[str]]]) – A dictionary mapping groups names to sets of entities.

  • rels (Optional[dict[str, tuple[str, str]]]) – A dictionary mapping relation names to tuples of group names.

  • collections (bool) – Whether to generate collection productions.

Return type:

Schema

Returns:

A Schema object.

classmethod from_forest(forest, *, keep_unlabelled=True, merge_lhs=True)[source]#

Create a Schema from a given forest of trees.

Parameters:
  • forest (Union[Collection[Tree], Iterable[Tree]]) – The input forest from which to derive the schema.

  • keep_unlabelled (bool) – Whether to keep uncategorized nodes in the schema.

  • merge_lhs (bool) – Whether to merge nodes in the schema.

Return type:

Schema

Returns:

A CFG-based schema representation.

as_cfg()[source]#

Convert the schema to a CFG representation.

Return type:

str

Returns:

The schema as a list of production rules, each terminated by a semicolon.

as_cypher()[source]#

Convert the schema to a Cypher representation.

It only define indexes and constraints as properties graph database do not have fixed schema.

TODO: Implement this method.

Return type:

str

Returns:

The schema as a Cypher creation script defining constraints and indexes.

as_sql()[source]#

Convert the schema to an SQL representation.

TODO: Implement this method.

Return type:

str

Returns:

The schema as an SQL creation script.

extract_datasets(forest)[source]#

Extract datasets from a forest for each group defined in the schema.

Parameters:

forest (Collection[Tree]) – The input forest to extract datasets from.

Return type:

dict[str, DataFrame]

Returns:

A mapping from group names to datasets.

extract_valid_trees(forest)[source]#

Filter and return a valid instance (according to the schema) of the provided forest.

It removes any subtrees with labels that do not match valid labels and gets rid of redundant collections.

Parameters:

forest (Collection[Tree]) – The input forest to be cleaned.

Return type:

Collection[Tree]

Returns:

A list of valid trees according to the schema.

verify()[source]#

Verify the schema against the meta-grammar.

Return type:

bool

Returns:

True if the schema is valid, False otherwise.

property entities#

The set of entities in the schema.

property group_balance_score#

Get the balance score of attributes across groups.

The balance metric (B) measures the dispersion of attributes (coefficient of variation), indicating if the schema is well-balanced. A higher balance metric indicates that attributes are distributed more evenly across groups, while a lower balance metric suggests that some groups may be too large (wide) or too small (fragmented).

\[B = 1 - \frac{\sigma(A)}{\mu(A)}\]
Where:
  • \(A\): The set of attribute counts for all groups.

  • \(\mu(A)\): The mean number of attributes per group.

  • \(\sigma(A)\): The standard deviation of attribute counts across groups.

returns: Balance metric (B), a measure of attribute dispersion.
  • \(B \approx 1\): Attributes are evenly distributed.

  • \(B \approx 0\): Significant imbalance; some groups are much larger or smaller than others.

Return type:

float

property group_overlap#

Get the group overlap ratio as a combined Jaccard index.

The group overlap ratio is computed as the mean of all pairwise Jaccard indices for each pair of groups.

Return type:

float

Returns:

The group overlap ratio as a float value between 0 and 1. A higher value indicates a higher degree of overlap between groups.

property groups#

The set of groups in the schema.

property relations#

The set of relations in the schema.