architxt.similarity

architxt.similarity#

Functions

`DEFAULT_METRIC`(x, y)	Jaro winkler similarity.
`compute_dist_matrix`(subtrees, *, metric)	Compute the condensed distance matrix for a collection of subtrees.
`entity_labels`(forest, *, tau[, metric])	Process the given forest to assign labels to entities based on clustering of their ancestor.
`equiv_cluster`(trees, *, tau[, metric, ...])	Cluster subtrees of a given tree based on their similarity.
`get_equiv_of`(t, equiv_subtrees, *, tau[, metric])	Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters.
`jaccard`(x, y)	Jaccard similarity.
`jaro`(x, y)	Jaro winkler similarity.
`levenshtein`(x, y)	Levenshtein similarity.
`sim`(x, y, tau[, metric])	Determine whether the similarity between two tree objects exceeds a given threshold tau.
`similarity`(x, y, *[, metric, decay])	Compute the similarity between two tree objects based on their entity labels and context.

architxt.similarity.DEFAULT_METRIC(x, y)#

Jaro winkler similarity.

Return type:: float

architxt.similarity.compute_dist_matrix(subtrees, *, metric)[source]#

Compute the condensed distance matrix for a collection of subtrees.

This function computes pairwise distances between all subtrees and stores the results in a condensed distance matrix format (1D array), which is suitable for hierarchical clustering.

The computation is sequential.

Parameters:

subtrees (Collection[Tree]) – A list of subtrees for which pairwise distances will be calculated.
metric (Callable[Collection[str], Collection[str], float]) – A callable similarity metric to compute the similarity between the two trees.

Return type:

ndarray[Any, dtype[uint16]]

Returns:

A 1D numpy array containing the condensed distance matrix (only a triangle of the full matrix).

architxt.similarity.entity_labels(forest, *, tau, metric=DEFAULT_METRIC)[source]#

Process the given forest to assign labels to entities based on clustering of their ancestor.

Parameters:

forest (Iterable[Tree]) – The forest from which to extract and cluster entities.
tau (float) – The similarity threshold for clustering.
metric (Optional[Callable[Collection[str], Collection[str], float]]) – The similarity metric function used to compute the similarity between subtrees. If None, use the parent label as the equivalent class.

Return type:

dict[UUID, str]

Returns:

A dictionary mapping entities to their respective cluster name.

architxt.similarity.equiv_cluster(trees, *, tau, metric=DEFAULT_METRIC, _all_subtrees=True, _step=None)[source]#

Cluster subtrees of a given tree based on their similarity.

The clusters are created by applying a distance threshold tau to the linkage matrix which is derived from pairwise subtree similarity calculations. Subtrees that are similar enough (based on tau and the metric) are grouped into clusters. Each cluster is represented as a tuple of subtrees.

Parameters:

trees (Iterable[Tree]) – The forest from which to extract and cluster subtrees.
tau (float) – The similarity threshold for clustering.
metric (Callable[Collection[str], Collection[str], float]) – The similarity metric function used to compute the similarity between subtrees.
_all_subtrees (bool) – If true, compute the similarity between all subtrees, else only the given trees are compared.
_step (Optional[int]) – The MLFlow step for logging.

Return type:

dict[str, Sequence[Tree]]

Returns:

A set of tuples, where each tuple represents a cluster of subtrees that meet the similarity threshold.

architxt.similarity.get_equiv_of(t, equiv_subtrees, *, tau, metric=DEFAULT_METRIC)[source]#

Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters.

The clusters are assessed using the provided similarity metric and threshold tau.

Parameters:

t (Tree) – The tree from which to extract and cluster subtrees.
equiv_subtrees (dict[str, Sequence[Tree]]) – The set of equivalent subtrees.
tau (float) – The similarity threshold for clustering.
metric (Callable[Collection[str], Collection[str], float]) – The similarity metric function used to compute the similarity between subtrees.

Return type:

Optional[str]

Returns:

The name of the cluster that meets the similarity threshold.

architxt.similarity.jaccard(x, y)[source]#

Jaccard similarity.

Parameters:

x (Collection[str]) – The first sequence of strings.
y (Collection[str]) – The second sequence of strings.

Return type:

float

Returns:

The Jaccard similarity as a float between 0 and 1, where 1 means identical sequences.

>>> jaccard({"A", "B"}, {"A", "B", "C"})
0.6666666666666666

>>> jaccard({"apple", "banana", "cherry"}, {"apple", "cherry", "date"})
0.5

>>> jaccard(set(), set())
1.0

architxt.similarity.jaro(x, y)[source]#

Jaro winkler similarity.

Return type:: float

architxt.similarity.levenshtein(x, y)[source]#

Levenshtein similarity.

Return type:: float

architxt.similarity.sim(x, y, tau, metric=DEFAULT_METRIC)[source]#

Determine whether the similarity between two tree objects exceeds a given threshold tau.

Parameters:

x (Tree) – The first tree object to compare.
y (Tree) – The second tree object to compare.
tau (float) – The threshold value for similarity.
metric (Callable[Collection[str], Collection[str], float]) – A callable similarity metric to compute the similarity between the two trees.

Return type:

bool

Returns:

True if the similarity between x and y is greater than or equal to tau, otherwise False.

>>> from architxt.tree import Tree
>>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))')
>>> sim(t[0], t[1], tau=0.5, metric=jaccard)
True

architxt.similarity.similarity(x, y, *, metric=DEFAULT_METRIC, decay=DECAY)[source]#

Compute the similarity between two tree objects based on their entity labels and context.

The function uses a specified metric (such as Jaccard, Levenshtein, or Jaro-Winkler) to calculate the similarity between the labels of entities in the trees. The similarity is computed as a recursive weighted mean for each tree anestor, where the weight decays with the distance from the tree.

\[\text{similarity}_\text{metric}(x, y) = \frac{\sum_{i=1}^{d_{\min}} \text{decay}^{-i} \cdot \text{metric}(P^x_i, P^y_i)} {\sum_{i=1}^{d_{\min}} \text{decay}^{-i}}\]

where \(P^x_i\) and \(P^y_i\) are the \(i^\text{th}\) parent nodes of \(x\) and \(y\) respectively, and \(d_{\\min}\) is the depth of the shallowest tree from \(x\) and \(y\) up to the root (or a fixed maximum depth).

Parameters:

x (Tree) – The first tree object.
y (Tree) – The second tree object.
metric (Callable[Collection[str], Collection[str], float]) – A metric function to compute the similarity between the entity labels of the two trees.
decay (float) – The decay factor for the weighted mean. The higher the value, the more the weight of context decreases with distance.

Return type:

float

Returns:

A similarity score between 0 and 1, where 1 indicates maximum similarity.

>>> from architxt.tree import Tree
>>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))')
>>> similarity(t[0], t[1], metric=jaccard)
0.5555555555555555

architxt.similarity

Contents

architxt.similarity#