architxt.metrics#

Functions

confidence(dataframe, column)

Compute the confidence score of the functional dependency X -> column in a DataFrame.

dependency_score(dataframe, attributes)

Compute the dependency score of a subset of attributes in a DataFrame.

redundancy_score(dataframe[, tau])

Compute the redundancy score for an entire DataFrame.

Classes

Metrics(source, destination)

class architxt.metrics.Metrics(source, destination)[source]#

Bases: object

cluster_ami(*, tau, metric=DEFAULT_METRIC)[source]#

Compute the Adjusted Mutual Information (AMI) score between source and destination clusters.

The AMI score measures agreement while adjusting for random chance. It use sklearn.metrics.adjusted_mutual_info_score() under the hood.

Greater is better.

Parameters:
Return type:

float

Returns:

The AMI score between the source and destination clusters.

cluster_completeness(*, tau, metric=DEFAULT_METRIC)[source]#

Compute the completeness score between source and destination clusters.

The AMI score measures agreement while adjusting for random chance. It use sklearn.metrics.completeness_score() under the hood.

Greater is better.

Parameters:
Return type:

float

Returns:

The completeness score between the source and destination clusters.

coverage()[source]#
Return type:

float

edit_distance()[source]#

Compute the total edit distance between corresponding source and destination trees.

The method calculates the edit distance for each pair of source and destination trees using the APTED algorithm. The total edit distance is obtained by summing up the individual distances across all pairs of trees.

Lower is better.

Return type:

int

Returns:

The total edit distance computed across all source and destination tree pairs.

redundancy(*, tau=1.0)[source]#

Compute the redundancy score for the entire instance.

The overall redundancy score measures the fraction of rows that are redundant in at least one subset of attributes that satisfies a functional dependency above a given threshold tau.

Lower is better.

Parameters:

tau (float) – The dependency threshold to determine redundancy (default is 1.0).

Return type:

float

Returns:

The proportion of redundant rows in the dataset.

similarity(*, metric=DEFAULT_METRIC)[source]#

Compute the similarity between the source and destination trees.

It uses the specified metric function to return the average similarity score.

Higher is better.

Parameters:

metric (Callable[Collection[str], Collection[str], float]) – The similarity metric function used to compute the similarity between subtrees.

Return type:

float

Returns:

The average similarity score for all tree pairs in source and destination forests.

architxt.metrics.confidence(dataframe, column)[source]#

Compute the confidence score of the functional dependency X -> column in a DataFrame.

The confidence score quantifies the strength of the association rule X -> column, where X represents the set of all other attributes in the DataFrame. It is computed as the median of the confidence scores across all instantiated association rules.

The confidence of each instantiated rule is calculated as the ratio of the consequent support (i.e., the count of each unique value in the specified column) to the antecedent support (i.e., the count of unique combinations of all other columns). A higher confidence score indicates a stronger dependency between the attributes.

Parameters:
  • dataframe (DataFrame) – A pandas DataFrame containing the data to analyze.

  • column (str) – The column for which to compute the confidence score.

Return type:

float

Returns:

The median confidence score or 0.0 if the data is empty.

>>> data = pd.DataFrame({
...     'A': ['x', 'y', 'x', 'x', 'y'],
...     'B': [1, 2, 1, 3, 2]
... })
>>> confidence(data, 'A')
1.0
>>> confidence(data, 'B')
0.6666666666666666
architxt.metrics.dependency_score(dataframe, attributes)[source]#

Compute the dependency score of a subset of attributes in a DataFrame.

The dependency score measures the strength of the functional dependency in the given subset of attributes. It is defined as the maximum confidence score among all attributes in the subset, treating each attribute as a potential consequent of a functional dependency.

Parameters:
  • dataframe (DataFrame) – A pandas DataFrame containing the data to analyze.

  • attributes (Collection[str]) – A list of attributes to evaluate for functional dependencies.

Return type:

float

Returns:

The maximum confidence score among the given attributes.

>>> data = pd.DataFrame({
...     'A': ['x', 'y', 'x', 'x', 'y'],
...     'B': [1, 2, 1, 3, 2]
... })
>>> dependency_score(data, ['A', 'B'])
1.0
architxt.metrics.redundancy_score(dataframe, tau=1.0)[source]#

Compute the redundancy score for an entire DataFrame.

The overall redundancy score measures the fraction of rows that are redundant in at least one subset of attributes that satisfies a functional dependency above a given threshold tau.

Parameters:
  • dataframe (DataFrame) – A pandas DataFrame containing the data to analyze.

  • tau (float) – The dependency threshold to determine redundancy (default is 1.0).

Return type:

float

Returns:

The proportion of redundant rows in the dataset.

>>> data = pd.DataFrame({
...     'A': ['x', 'y', 'x', 'x', 'y'],
...     'B': [1, 2, 1, 3, 2]
... })
>>> dependency_score(data, ['A', 'B'])
1.0