architxt.metrics#
Functions
|
Compute the confidence score of the functional dependency |
|
Compute the dependency score of a subset of attributes in a DataFrame. |
|
Compute the redundancy score for an entire DataFrame. |
Classes
|
- class architxt.metrics.Metrics(source, destination)[source]#
Bases:
object
- cluster_ami(*, tau, metric=DEFAULT_METRIC)[source]#
Compute the Adjusted Mutual Information (AMI) score between source and destination clusters.
The AMI score measures agreement while adjusting for random chance. It use
sklearn.metrics.adjusted_mutual_info_score()
under the hood.Greater is better.
- Parameters:
tau (
float
) – The similarity threshold for clustering.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – The similarity metric function used to compute the similarity between subtrees.
- Return type:
- Returns:
The AMI score between the source and destination clusters.
- cluster_completeness(*, tau, metric=DEFAULT_METRIC)[source]#
Compute the completeness score between source and destination clusters.
The AMI score measures agreement while adjusting for random chance. It use
sklearn.metrics.completeness_score()
under the hood.Greater is better.
- Parameters:
tau (
float
) – The similarity threshold for clustering.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – The similarity metric function used to compute the similarity between subtrees.
- Return type:
- Returns:
The completeness score between the source and destination clusters.
- edit_distance()[source]#
Compute the total edit distance between corresponding source and destination trees.
The method calculates the edit distance for each pair of source and destination trees using the APTED algorithm. The total edit distance is obtained by summing up the individual distances across all pairs of trees.
Lower is better.
- Return type:
- Returns:
The total edit distance computed across all source and destination tree pairs.
- redundancy(*, tau=1.0)[source]#
Compute the redundancy score for the entire instance.
The overall redundancy score measures the fraction of rows that are redundant in at least one subset of attributes that satisfies a functional dependency above a given threshold tau.
Lower is better.
- similarity(*, metric=DEFAULT_METRIC)[source]#
Compute the similarity between the source and destination trees.
It uses the specified metric function to return the average similarity score.
Higher is better.
- Parameters:
metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – The similarity metric function used to compute the similarity between subtrees.- Return type:
- Returns:
The average similarity score for all tree pairs in source and destination forests.
- architxt.metrics.confidence(dataframe, column)[source]#
Compute the confidence score of the functional dependency
X -> column
in a DataFrame.The confidence score quantifies the strength of the association rule
X -> column
, whereX
represents the set of all other attributes in the DataFrame. It is computed as the median of the confidence scores across all instantiated association rules.The confidence of each instantiated rule is calculated as the ratio of the consequent support (i.e., the count of each unique value in the specified column) to the antecedent support (i.e., the count of unique combinations of all other columns). A higher confidence score indicates a stronger dependency between the attributes.
- Parameters:
- Return type:
- Returns:
The median confidence score or
0.0
if the data is empty.
>>> data = pd.DataFrame({ ... 'A': ['x', 'y', 'x', 'x', 'y'], ... 'B': [1, 2, 1, 3, 2] ... }) >>> confidence(data, 'A') 1.0 >>> confidence(data, 'B') 0.6666666666666666
- architxt.metrics.dependency_score(dataframe, attributes)[source]#
Compute the dependency score of a subset of attributes in a DataFrame.
The dependency score measures the strength of the functional dependency in the given subset of attributes. It is defined as the maximum confidence score among all attributes in the subset, treating each attribute as a potential consequent of a functional dependency.
- Parameters:
dataframe (
DataFrame
) – A pandas DataFrame containing the data to analyze.attributes (
Collection
[str
]) – A list of attributes to evaluate for functional dependencies.
- Return type:
- Returns:
The maximum confidence score among the given attributes.
>>> data = pd.DataFrame({ ... 'A': ['x', 'y', 'x', 'x', 'y'], ... 'B': [1, 2, 1, 3, 2] ... }) >>> dependency_score(data, ['A', 'B']) 1.0
- architxt.metrics.redundancy_score(dataframe, tau=1.0)[source]#
Compute the redundancy score for an entire DataFrame.
The overall redundancy score measures the fraction of rows that are redundant in at least one subset of attributes that satisfies a functional dependency above a given threshold tau.
- Parameters:
- Return type:
- Returns:
The proportion of redundant rows in the dataset.
>>> data = pd.DataFrame({ ... 'A': ['x', 'y', 'x', 'x', 'y'], ... 'B': [1, 2, 1, 3, 2] ... }) >>> dependency_score(data, ['A', 'B']) 1.0