architxt.similarity#
Functions
|
Jaro winkler similarity. |
|
Process the given forest to assign labels to entities based on clustering of their ancestor. |
|
Jaccard similarity. |
|
Jaro winkler similarity. |
|
Levenshtein similarity. |
|
Compute the similarity between two tree objects based on their entity labels and context. |
Classes
|
|
|
|
|
- class architxt.similarity.TreeClusterer(tau=0.7, decay=DECAY, metric=DEFAULT_METRIC, max_sim_ctx_depth=MAX_SIM_CTX_DEPTH, max_height=5, min_cluster_size=2, **kwargs)[source]#
Bases:
object- fit(forest, _all_subtrees=True)[source]#
Cluster subtrees of a given tree based on their similarity.
The clusters are created by applying a distance threshold tau to the linkage matrix which is derived from pairwise subtree similarity calculations. Subtrees that are similar enough (based on tau and the metric) are grouped into clusters. Each cluster is represented as a tuple of subtrees.
- Parameters:
forest (
Union[Collection[Tree],TreeBucket]) – The forest from which to extract and cluster subtrees._all_subtrees (
bool) – If true, compute the similarity between all subtrees, else only the given trees are compared.
- Return type:
- Returns:
A set of tuples, where each tuple represents a cluster of subtrees that meet the similarity threshold.
- get_equiv_of(t, top_k=20)[source]#
Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters.
The clusters are assessed using the provided similarity metric and threshold tau.
- Parameters:
- Return type:
Optional[str]- Returns:
The name of the cluster that meets the similarity threshold.
- Raises:
ValueError – If clusters have not been computed yet.
- architxt.similarity.entity_labels(forest, *, tau, metric=DEFAULT_METRIC, decay=DECAY)[source]#
Process the given forest to assign labels to entities based on clustering of their ancestor.
- Parameters:
forest (
Iterable[Tree]) – The forest from which to extract and cluster entities.tau (
float) – The similarity threshold for clustering.metric (
Optional[Callable[Collection[str],Collection[str],float]]) – The similarity metric function used to compute the similarity between subtrees. If None, use the parent label as the equivalent class.decay (
float) – The similarity decay factor. The higher the value, the more the weight of context decreases with distance.
- Return type:
- Returns:
A dictionary mapping entities to their respective cluster name.
- architxt.similarity.jaccard(x, y)[source]#
Jaccard similarity.
- Parameters:
x (
Collection[str]) – The first sequence of strings.y (
Collection[str]) – The second sequence of strings.
- Return type:
- Returns:
The Jaccard similarity as a float between 0 and 1, where 1 means identical sequences.
>>> jaccard({"A", "B"}, {"A", "B", "C"}) 0.6666666666666666
>>> jaccard({"apple", "banana", "cherry"}, {"apple", "cherry", "date"}) 0.5
>>> jaccard(set(), set()) 1.0
- architxt.similarity.similarity(x, y, *, metric=DEFAULT_METRIC, decay=DECAY, max_sim_ctx_depth=MAX_SIM_CTX_DEPTH)[source]#
Compute the similarity between two tree objects based on their entity labels and context.
The function uses a specified metric (such as Jaccard, Levenshtein, or Jaro-Winkler) to calculate the similarity between the labels of entities in the trees. The similarity is computed as a recursive weighted mean for each tree anestor, where the weight decays with the distance from the tree.
\[\text{similarity}_\text{metric}(x, y) = \frac{\sum_{i=0}^{d_{\min}} \text{decay}^{-i} \cdot \text{metric}(P^x_i, P^y_i)} {\sum_{i=0}^{d_{\min}} \text{decay}^{-i}}\]where \(P^x_i\) and \(P^y_i\) are the \(i^\text{th}\) parent nodes of \(x\) and \(y\) respectively, and \(d_{\\min}\) is the depth of the shallowest tree from \(x\) and \(y\) up to the root (or a fixed maximum depth of max_sim_ctx_depth).
- Parameters:
x (
Tree) – The first tree object.y (
Tree) – The second tree object.metric (
Callable[Collection[str],Collection[str],float]) – A metric function to compute the similarity between the entity labels of the two trees.decay (
float) – The decay factor for the weighted mean. Must be strictly greater than 0. The higher the value, the more the weight of context decreases with distance.max_sim_ctx_depth (
int) – The maximum depth of context to consider when computing similarity.
- Return type:
- Returns:
A similarity score between 0 and 1, where 1 indicates maximum similarity.
>>> from architxt.tree import Tree >>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))') >>> similarity(t[0], t[1], metric=jaccard) 0.5555555555555555