architxt.metrics#

Functions

confidence(dataframe, column)

Compute the functional-dependency (FD) confidence for X -> column in a DataFrame.

dependency_score(dataframe, attributes)

Compute the dependency score of a subset of attributes in a DataFrame.

redundancy_score(dataframe[, tau])

Compute the redundancy score for an entire DataFrame.

Classes

Metrics(forest, *, tau[, decay, metric])

A class to compute various comparison metrics between the original and modified forest states.

class architxt.metrics.Metrics(forest, *, tau, decay=DECAY, metric=DEFAULT_METRIC)[source]#

Bases: object

A class to compute various comparison metrics between the original and modified forest states.

This class is designed to track and measure changes in a forest structure that is modified in-place. It stores the initial state of the forest when instantiated and provides methods to compare the current state with the initial state using various metrics.

Parameters:
  • forest (Collection[Tree]) – The forest to analyze

  • tau (float) – Threshold for subtree similarity when clustering.

  • decay (float) – The similarity decay factor. The higher the value, the more the weight of context decreases with distance.

  • metric (Callable[Collection[str], Collection[str], float]) – The metric function used to compute similarity between subtrees.

>>> forest = [tree1, tree2, tree3]
... metrics = Metrics(forest, tau=0.7)
... # Modify forest in-place
... simplify(forest, tau=0.7)
... # Update the metrics object
... metrics.update()
... # Compare with the initial state
... similarity = metrics.cluster_ami()
cluster_ami()[source]#

Compute the Adjusted Mutual Information (AMI) score between original and current clusters.

The AMI score measures the agreement between two clustering while adjusting for chance. It uses sklearn.metrics.adjusted_mutual_info_score() under the hood.

Greater is better.

Return type:

float

Returns:

Score between -1 and 1, where: - 1 indicates perfect agreement - 0 indicates random label assignments - negative values indicate worse than random labeling

cluster_completeness()[source]#

Compute the completeness score between original and current clusters.

Completeness measures if all members of a given class are assigned to the same cluster. It uses sklearn.metrics.completeness_score() under the hood.

Greater is better.

Return type:

float

Returns:

Score between 0 and 1, where: - 1 indicates perfect completeness - 0 indicates the worst possible completeness

coverage()[source]#

Compute the coverage between initial and current forest states.

Coverage is measured using the jaccard() similarity between the sets of entities in the original and current states.

Greater is better.

Return type:

float

Returns:

Coverage score between 0 and 1, where 1 indicates identical entity sets

group_balance_score()[source]#

Get the group balance score.

See: architxt.schema.Schema.group_balance_score()

Return type:

float

group_balance_score_origin()[source]#

Get the origin group balance score.

See: architxt.schema.Schema.group_balance_score()

Return type:

float

group_overlap()[source]#

Get the schema group overlap ratio.

See: architxt.schema.Schema.group_overlap()

Return type:

float

group_overlap_origin()[source]#

Get the origin schema group overlap ratio.

See: architxt.schema.Schema.group_overlap()

Return type:

float

log_to_mlflow(iteration, *, debug=False)[source]#

Log various metrics related to a forest of trees and equivalent subtrees.

Parameters:
  • iteration (int) – The current iteration number for logging.

  • debug (bool) – Whether to enable debug logging.

Return type:

None

num_distinct_type(node_type)[source]#

Get the number of distinct labels in the schema that match the given node type.

Parameters:

node_type (NodeType) – The type to filter by.

Return type:

int

num_nodes()[source]#

Get the total number of nodes in the forest.

Return type:

int

num_non_terminal()[source]#

Get the number of non-terminal nodes in the schema.

Return type:

int

num_productions()[source]#

Get the number of productions in the schema.

Return type:

int

num_productions_origin()[source]#

Get the number of productions in the origin schema.

Return type:

int

num_type(node_type)[source]#

Get the total number of nodes in the forest that match the given node type.

Parameters:

node_type (NodeType) – The type to filter by.

Return type:

int

num_unlabeled_nodes()[source]#

Get the total number of unlabeled nodes in the forest.

Return type:

int

ratio_productions()[source]#

Get the ratio of productions in the schema compare to the origin schema.

Return type:

float

ratio_type(node_type)[source]#

Return the average number of nodes per distinct label for the given node type.

Parameters:

node_type (NodeType) – The type to filter by.

Return type:

float

ratio_unlabeled_nodes()[source]#

Get the ratio of unlabeled nodes in the forest.

Return type:

float

redundancy(*, tau=1.0)[source]#

Compute the redundancy score for the current forest state.

The overall redundancy score measures the fraction of rows that are redundant in at least one subset of attributes that satisfies a functional dependency above a given threshold tau.

Lower is better.

Parameters:

tau (float) – The dependency threshold to determine redundancy (default is 1.0).

Return type:

float

Returns:

Score between 0 and 1, where: - 0 indicates no redundancy - 1 indicates complete redundancy

update(forest=None)[source]#

Update the internal state of the metrics object.

Parameters:

forest (Optional[Collection[Tree]]) – The forest to compare against, else read the original modified forest

Return type:

None

architxt.metrics.confidence(dataframe, column)[source]#

Compute the functional-dependency (FD) confidence for X -> column in a DataFrame.

\[\mathrm{conf}(X \to column) = \frac{\sum_{x \in \mathrm{dom}(X)} \max_{y \in \mathrm{dom}(Y)} \mathrm{count}(X = x, Y = y)}{N}\]

Where: - \(X\) is the set of all other attributes in the DataFrame (antecedent) - \(column\) is the consequent attribute - \(\mathrm{dom}(X)\) is the set of all unique combinations of (\(X\)) - \(N\) is the total number of rows in the DataFrame

Intuitively, for each antecedent combination X=x take the count of the most frequent consequent value y, sum those maxima, and divide by the number of rows. This is the fraction of rows explained by choosing the per-antecedent majority consequent.

Parameters:
  • dataframe (DataFrame) – A Pandas DataFrame containing the data to analyze.

  • column (str) – Name of the consequent column (must be present in dataframe).

Return type:

float

Returns:

FD confidence in [0.0, 1.0]; returns 0.0 for empty dataframe, single-column dataframe, or when column is not present.

>>> data = pd.DataFrame({
...     'A': ['x', 'y', 'x', 'x', 'y'],
...     'B': [1, 2, 1, 3, 2]
... })
>>> confidence(data, 'A')
1.0
>>> confidence(data, 'B')
0.8
architxt.metrics.dependency_score(dataframe, attributes)[source]#

Compute the dependency score of a subset of attributes in a DataFrame.

The dependency score measures the strength of the functional dependency in the given subset of attributes. It is defined as the maximum confidence score among all attributes in the subset, treating each attribute as a potential consequent of a functional dependency.

Parameters:
  • dataframe (DataFrame) – A Pandas DataFrame containing the data to analyze.

  • attributes (Collection[str]) – A list of attributes to evaluate for functional dependencies.

Return type:

float

Returns:

The maximum confidence score among the given attributes.

>>> data = pd.DataFrame({
...     'A': ['x', 'y', 'x', 'x', 'y'],
...     'B': [1, 2, 1, 3, 2]
... })
>>> dependency_score(data, ['A', 'B'])
1.0
architxt.metrics.redundancy_score(dataframe, tau=1.0)[source]#

Compute the redundancy score for an entire DataFrame.

The overall redundancy score measures the fraction of rows that are redundant in at least one subset of attributes that satisfies a functional dependency above a given threshold tau.

Parameters:
  • dataframe (DataFrame) – A Pandas DataFrame containing the data to analyze.

  • tau (float) – The dependency threshold to determine redundancy (default is 1.0).

Return type:

float

Returns:

The proportion of redundant rows in the dataset.

>>> data = pd.DataFrame({
...     'A': ['x', 'y', 'x', 'x', 'y'],
...     'B': [1, 2, 1, 3, 2]
... })
>>> redundancy_score(data)
0.8