architxt.similarity#
Functions
|
Jaro winkler similarity. |
|
Compute the condensed distance matrix for a collection of subtrees. |
|
Process the given forest to assign labels to entities based on clustering of their ancestor. |
|
Cluster subtrees of a given tree based on their similarity. |
|
Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters. |
|
Jaccard similarity. |
|
Jaro winkler similarity. |
|
Levenshtein similarity. |
|
Determine whether the similarity between two tree objects exceeds a given threshold tau. |
|
Compute the similarity between two tree objects based on their entity labels and context. |
- architxt.similarity.compute_dist_matrix(subtrees, *, metric)[source]#
Compute the condensed distance matrix for a collection of subtrees.
This function computes pairwise distances between all subtrees and stores the results in a condensed distance matrix format (1D array), which is suitable for hierarchical clustering.
The computation is sequential.
- Parameters:
subtrees (
Collection
[Tree
]) – A list of subtrees for which pairwise distances will be calculated.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – A callable similarity metric to compute the similarity between the two trees.
- Return type:
- Returns:
A 1D numpy array containing the condensed distance matrix (only a triangle of the full matrix).
- architxt.similarity.entity_labels(forest, *, tau, metric=DEFAULT_METRIC)[source]#
Process the given forest to assign labels to entities based on clustering of their ancestor.
- Parameters:
forest (
Collection
[Tree
]) – The forest from which to extract and cluster entities.tau (
float
) – The similarity threshold for clustering.metric (
Optional
[Callable
[Collection
[str
],Collection
[str
],float
]]) – The similarity metric function used to compute the similarity between subtrees. If None, use the parent label as the equivalent class.
- Return type:
- Returns:
A dictionary mapping entities to their respective cluster IDs.
- architxt.similarity.equiv_cluster(trees, *, tau, metric=DEFAULT_METRIC, _all_subtrees=True, _step=None)[source]#
Cluster subtrees of a given tree based on their similarity.
The clusters are created by applying a distance threshold tau to the linkage matrix which is derived from pairwise subtree similarity calculations. Subtrees that are similar enough (based on tau and the metric) are grouped into clusters. Each cluster is represented as a tuple of subtrees.
- Parameters:
trees (
Collection
[Tree
]) – The forest from which to extract and cluster subtrees.tau (
float
) – The similarity threshold for clustering.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – The similarity metric function used to compute the similarity between subtrees.
- Return type:
- Returns:
A set of tuples, where each tuple represents a cluster of subtrees that meet the similarity threshold.
- architxt.similarity.get_equiv_of(t, equiv_subtrees, *, tau, metric=DEFAULT_METRIC)[source]#
Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters.
The clusters are assessed using the provided similarity metric and threshold tau.
- Parameters:
t (
Tree
) – The tree from which to extract and cluster subtrees.equiv_subtrees (
set
[tuple
[Tree
, …]]) – The set of equivalent subtrees.tau (
float
) – The similarity threshold for clustering.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – The similarity metric function used to compute the similarity between subtrees.
- Return type:
- Returns:
A tuple representing the cluster of subtrees that meet the similarity threshold.
- architxt.similarity.jaccard(x, y)[source]#
Jaccard similarity.
- Parameters:
x (
Collection
[str
]) – The first sequence of strings.y (
Collection
[str
]) – The second sequence of strings.
- Return type:
- Returns:
The Jaccard similarity as a float between 0 and 1, where 1 means identical sequences.
>>> jaccard({"A", "B"}, {"A", "B", "C"}) 0.6666666666666666
>>> jaccard({"apple", "banana", "cherry"}, {"apple", "cherry", "date"}) 0.5
>>> jaccard(set(), set()) 1.0
- architxt.similarity.sim(x, y, tau, metric=DEFAULT_METRIC)[source]#
Determine whether the similarity between two tree objects exceeds a given threshold tau.
- Parameters:
x (
Tree
) – The first tree object to compare.y (
Tree
) – The second tree object to compare.tau (
float
) – The threshold value for similarity.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – A callable similarity metric to compute the similarity between the two trees.
- Return type:
- Returns:
True if the similarity between x and y is greater than or equal to tau, otherwise False.
>>> from architxt.tree import Tree >>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))') >>> sim(t[0], t[1], tau=0.5, metric=jaccard) True
- architxt.similarity.similarity(x, y, *, metric=DEFAULT_METRIC)[source]#
Compute the similarity between two tree objects based on their entity labels and context.
The function uses a specified metric (such as Jaccard, Levenshtein, or Jaro-Winkler) to calculate the similarity between the labels of entities in the trees. The similarity is computed as recursive weighted mean for each tree anestor.
- Parameters:
x (
Tree
) – The first tree object.y (
Tree
) – The second tree object.metric (
Callable
[Collection
[str
],Collection
[str
],float
]) – A metric function to compute the similarity between the entity labels of the two trees.
- Return type:
- Returns:
A similarity score between 0 and 1, where 1 indicates maximum similarity.
>>> from architxt.tree import Tree >>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))') >>> similarity(t[0], t[1], metric=jaccard) 0.5555555555555555