architxt.similarity#

Functions

DEFAULT_METRIC(x, y)

Jaro winkler similarity.

compute_dist_matrix(subtrees, *, metric)

Compute the condensed distance matrix for a collection of subtrees.

entity_labels(forest, *, tau[, metric])

Process the given forest to assign labels to entities based on clustering of their ancestor.

equiv_cluster(trees, *, tau[, metric, ...])

Cluster subtrees of a given tree based on their similarity.

get_equiv_of(t, equiv_subtrees, *, tau[, metric])

Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters.

jaccard(x, y)

Jaccard similarity.

jaro(x, y)

Jaro winkler similarity.

levenshtein(x, y)

Levenshtein similarity.

sim(x, y, tau[, metric])

Determine whether the similarity between two tree objects exceeds a given threshold tau.

similarity(x, y, *[, metric])

Compute the similarity between two tree objects based on their entity labels and context.

architxt.similarity.DEFAULT_METRIC(x, y)#

Jaro winkler similarity.

Return type:

float

architxt.similarity.compute_dist_matrix(subtrees, *, metric)[source]#

Compute the condensed distance matrix for a collection of subtrees.

This function computes pairwise distances between all subtrees and stores the results in a condensed distance matrix format (1D array), which is suitable for hierarchical clustering.

The computation is sequential.

Parameters:
Return type:

ndarray[Any, dtype[uint16]]

Returns:

A 1D numpy array containing the condensed distance matrix (only a triangle of the full matrix).

architxt.similarity.entity_labels(forest, *, tau, metric=DEFAULT_METRIC)[source]#

Process the given forest to assign labels to entities based on clustering of their ancestor.

Parameters:
  • forest (Collection[Tree]) – The forest from which to extract and cluster entities.

  • tau (float) – The similarity threshold for clustering.

  • metric (Optional[Callable[Collection[str], Collection[str], float]]) – The similarity metric function used to compute the similarity between subtrees. If None, use the parent label as the equivalent class.

Return type:

dict[str, int]

Returns:

A dictionary mapping entities to their respective cluster IDs.

architxt.similarity.equiv_cluster(trees, *, tau, metric=DEFAULT_METRIC, _all_subtrees=True, _step=None)[source]#

Cluster subtrees of a given tree based on their similarity.

The clusters are created by applying a distance threshold tau to the linkage matrix which is derived from pairwise subtree similarity calculations. Subtrees that are similar enough (based on tau and the metric) are grouped into clusters. Each cluster is represented as a tuple of subtrees.

Parameters:
Return type:

set[tuple[Tree, …]]

Returns:

A set of tuples, where each tuple represents a cluster of subtrees that meet the similarity threshold.

architxt.similarity.get_equiv_of(t, equiv_subtrees, *, tau, metric=DEFAULT_METRIC)[source]#

Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters.

The clusters are assessed using the provided similarity metric and threshold tau.

Parameters:
  • t (Tree) – The tree from which to extract and cluster subtrees.

  • equiv_subtrees (set[tuple[Tree, …]]) – The set of equivalent subtrees.

  • tau (float) – The similarity threshold for clustering.

  • metric (Callable[Collection[str], Collection[str], float]) – The similarity metric function used to compute the similarity between subtrees.

Return type:

tuple[Tree, …]

Returns:

A tuple representing the cluster of subtrees that meet the similarity threshold.

architxt.similarity.jaccard(x, y)[source]#

Jaccard similarity.

Parameters:
Return type:

float

Returns:

The Jaccard similarity as a float between 0 and 1, where 1 means identical sequences.

>>> jaccard({"A", "B"}, {"A", "B", "C"})
0.6666666666666666
>>> jaccard({"apple", "banana", "cherry"}, {"apple", "cherry", "date"})
0.5
>>> jaccard(set(), set())
1.0
architxt.similarity.jaro(x, y)[source]#

Jaro winkler similarity.

Return type:

float

architxt.similarity.levenshtein(x, y)[source]#

Levenshtein similarity.

Return type:

float

architxt.similarity.sim(x, y, tau, metric=DEFAULT_METRIC)[source]#

Determine whether the similarity between two tree objects exceeds a given threshold tau.

Parameters:
  • x (Tree) – The first tree object to compare.

  • y (Tree) – The second tree object to compare.

  • tau (float) – The threshold value for similarity.

  • metric (Callable[Collection[str], Collection[str], float]) – A callable similarity metric to compute the similarity between the two trees.

Return type:

bool

Returns:

True if the similarity between x and y is greater than or equal to tau, otherwise False.

>>> from architxt.tree import Tree
>>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))')
>>> sim(t[0], t[1], tau=0.5, metric=jaccard)
True
architxt.similarity.similarity(x, y, *, metric=DEFAULT_METRIC)[source]#

Compute the similarity between two tree objects based on their entity labels and context.

The function uses a specified metric (such as Jaccard, Levenshtein, or Jaro-Winkler) to calculate the similarity between the labels of entities in the trees. The similarity is computed as recursive weighted mean for each tree anestor.

Parameters:
Return type:

float

Returns:

A similarity score between 0 and 1, where 1 indicates maximum similarity.

>>> from architxt.tree import Tree
>>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))')
>>> similarity(t[0], t[1], metric=jaccard)
0.5555555555555555