architxt.metrics#
Functions
|
Compute the confidence score of the functional dependency |
|
Compute the dependency score of a subset of attributes in a DataFrame. |
|
Compute the redundancy score for an entire DataFrame. |
Classes
|
A class to compute various comparison metrics between the original and modified forest states. |
- class architxt.metrics.Metrics(forest, *, tau, metric=DEFAULT_METRIC)[source]#
Bases:
object
A class to compute various comparison metrics between the original and modified forest states.
This class is designed to track and measure changes in a forest structure that is modified in-place. It stores the initial state of the forest when instantiated and provides methods to compare the current state with the initial state using various metrics.
- Parameters:
forest (
Collection
[Tree
]) – The forest to analyzetau (
float
) – Threshold for subtree similarity when clustering.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – The metric function used to compute similarity between subtrees.
>>> forest = [tree1, tree2, tree3] ... metrics = Metrics(forest, tau=0.7) ... # Modify forest in-place ... simplify(forest, tau=0.7) ... # Update the metrics object ... metrics.update() ... # Compare with the initial state ... similarity = metrics.cluster_ami()
- cluster_ami()[source]#
Compute the Adjusted Mutual Information (AMI) score between original and current clusters.
The AMI score measures the agreement between two clustering while adjusting for chance. It uses
sklearn.metrics.adjusted_mutual_info_score()
under the hood.Greater is better.
- Return type:
- Returns:
Score between -1 and 1, where: - 1 indicates perfect agreement - 0 indicates random label assignments - negative values indicate worse than random labeling
- cluster_completeness()[source]#
Compute the completeness score between original and current clusters.
Completeness measures if all members of a given class are assigned to the same cluster. It uses
sklearn.metrics.completeness_score()
under the hood.Greater is better.
- Return type:
- Returns:
Score between 0 and 1, where: - 1 indicates perfect completeness - 0 indicates worst possible completeness
- coverage()[source]#
Compute the coverage between initial and current forest states.
Coverage is measured using the
jaccard()
similarity between the sets of entities in the original and current states.Greater is better.
- Return type:
- Returns:
Coverage score between 0 and 1, where 1 indicates identical entity sets
- group_balance_score()[source]#
Get the group balance score.
See:
architxt.schema.Schema.group_balance_score()
- Return type:
- group_balance_score_origin()[source]#
Get the origin group balance score.
See:
architxt.schema.Schema.group_balance_score()
- Return type:
- group_overlap()[source]#
Get the schema group overlap ratio.
See:
architxt.schema.Schema.group_overlap()
- Return type:
- group_overlap_origin()[source]#
Get the origin schema group overlap ratio.
See:
architxt.schema.Schema.group_overlap()
- Return type:
- log_to_mlflow(iteration, *, debug=False)[source]#
Log various metrics related to a forest of trees and equivalent subtrees.
- num_distinct_type(node_type)[source]#
Get the number of distinct labels in the schema that match the given node type.
- num_type(node_type)[source]#
Get the total number of nodes in the forest that match the given node type.
- ratio_productions()[source]#
Get the ratio of productions in the schema compare to the origin schema.
- Return type:
- ratio_type(node_type)[source]#
Return the average number of nodes per distinct label for the given node type.
- architxt.metrics.confidence(dataframe, column)[source]#
Compute the confidence score of the functional dependency
X -> column
in a DataFrame.The confidence score quantifies the strength of the association rule
X -> column
, whereX
represents the set of all other attributes in the DataFrame. It is computed as the median of the confidence scores across all instantiated association rules.The confidence of each instantiated rule is calculated as the ratio of the consequent support (i.e., the count of each unique value in the specified column) to the antecedent support (i.e., the count of unique combinations of all other columns). A higher confidence score indicates a stronger dependency between the attributes.
- Parameters:
- Return type:
- Returns:
The median confidence score or
0.0
if the data is empty.
>>> data = pd.DataFrame({ ... 'A': ['x', 'y', 'x', 'x', 'y'], ... 'B': [1, 2, 1, 3, 2] ... }) >>> confidence(data, 'A') 1.0 >>> confidence(data, 'B') 0.6666666666666666
- architxt.metrics.dependency_score(dataframe, attributes)[source]#
Compute the dependency score of a subset of attributes in a DataFrame.
The dependency score measures the strength of the functional dependency in the given subset of attributes. It is defined as the maximum confidence score among all attributes in the subset, treating each attribute as a potential consequent of a functional dependency.
- Parameters:
dataframe (
DataFrame
) – A pandas DataFrame containing the data to analyze.attributes (
Collection
[str
]) – A list of attributes to evaluate for functional dependencies.
- Return type:
- Returns:
The maximum confidence score among the given attributes.
>>> data = pd.DataFrame({ ... 'A': ['x', 'y', 'x', 'x', 'y'], ... 'B': [1, 2, 1, 3, 2] ... }) >>> dependency_score(data, ['A', 'B']) 1.0
- architxt.metrics.redundancy_score(dataframe, tau=1.0)[source]#
Compute the redundancy score for an entire DataFrame.
The overall redundancy score measures the fraction of rows that are redundant in at least one subset of attributes that satisfies a functional dependency above a given threshold tau.
- Parameters:
- Return type:
- Returns:
The proportion of redundant rows in the dataset.
>>> data = pd.DataFrame({ ... 'A': ['x', 'y', 'x', 'x', 'y'], ... 'B': [1, 2, 1, 3, 2] ... }) >>> dependency_score(data, ['A', 'B']) 1.0