architxt.similarity#

Functions

DEFAULT_METRIC(x, y)

Jaro winkler similarity.

entity_labels(forest, *, tau[, metric, decay])

Process the given forest to assign labels to entities based on clustering of their ancestor.

jaccard(x, y)

Jaccard similarity.

jaro(x, y)

Jaro winkler similarity.

levenshtein(x, y)

Levenshtein similarity.

similarity(x, y, *[, metric, decay, ...])

Compute the similarity between two tree objects based on their entity labels and context.

Classes

TreeCluster(trees, probabilities)

TreeClusterView(clusterer)

TreeClusterer([tau, decay, metric, ...])

class architxt.similarity.TreeCluster(trees, probabilities)[source]#

Bases: object

probabilities#

Type:    ndarray[Any, dtype[float64]]

trees#

Type:    Sequence[Union[Tree, Any]]

class architxt.similarity.TreeClusterView(clusterer)[source]#

Bases: Mapping[str, Sequence[Tree]]

class architxt.similarity.TreeClusterer(tau=0.7, decay=DECAY, metric=DEFAULT_METRIC, max_sim_ctx_depth=MAX_SIM_CTX_DEPTH, max_height=5, min_cluster_size=2, **kwargs)[source]#

Bases: object

fit(forest, _all_subtrees=True)[source]#

Cluster subtrees of a given tree based on their similarity.

The clusters are created by applying a distance threshold tau to the linkage matrix which is derived from pairwise subtree similarity calculations. Subtrees that are similar enough (based on tau and the metric) are grouped into clusters. Each cluster is represented as a tuple of subtrees.

Parameters:
  • forest (Union[Collection[Tree], TreeBucket]) – The forest from which to extract and cluster subtrees.

  • _all_subtrees (bool) – If true, compute the similarity between all subtrees, else only the given trees are compared.

Return type:

None

Returns:

A set of tuples, where each tuple represents a cluster of subtrees that meet the similarity threshold.

fit_predict(forest, **kwargs)[source]#
Return type:

Mapping[str, Sequence[Tree]]

get_cluster(key)[source]#
Return type:

TreeCluster

get_clusters_keys()[source]#
Return type:

set[str]

get_equiv_of(t, top_k=20)[source]#

Get the cluster containing the specified tree t based on similarity comparisons with the given set of clusters.

The clusters are assessed using the provided similarity metric and threshold tau.

Parameters:
  • t (Tree) – The tree from which to extract and cluster subtrees.

  • top_k (Optional[int]) – Compute the similarity only against the top_k elements. If None compute it against every element of the clusters.

Return type:

Optional[str]

Returns:

The name of the cluster that meets the similarity threshold.

Raises:

ValueError – If clusters have not been computed yet.

mlflow_plot(base_path)[source]#

Plot clustering result as mlflow artifacts.

Parameters:

base_path (str) – The base path where to save the artifacts.

Return type:

None

property clusters#
Return type:

Mapping[str, Sequence[Tree]]

architxt.similarity.DEFAULT_METRIC(x, y)#

Jaro winkler similarity.

Return type:

float

architxt.similarity.entity_labels(forest, *, tau, metric=DEFAULT_METRIC, decay=DECAY)[source]#

Process the given forest to assign labels to entities based on clustering of their ancestor.

Parameters:
  • forest (Iterable[Tree]) – The forest from which to extract and cluster entities.

  • tau (float) – The similarity threshold for clustering.

  • metric (Optional[Callable[Collection[str], Collection[str], float]]) – The similarity metric function used to compute the similarity between subtrees. If None, use the parent label as the equivalent class.

  • decay (float) – The similarity decay factor. The higher the value, the more the weight of context decreases with distance.

Return type:

dict[UUID, str]

Returns:

A dictionary mapping entities to their respective cluster name.

architxt.similarity.jaccard(x, y)[source]#

Jaccard similarity.

Parameters:
Return type:

float

Returns:

The Jaccard similarity as a float between 0 and 1, where 1 means identical sequences.

>>> jaccard({"A", "B"}, {"A", "B", "C"})
0.6666666666666666
>>> jaccard({"apple", "banana", "cherry"}, {"apple", "cherry", "date"})
0.5
>>> jaccard(set(), set())
1.0
architxt.similarity.jaro(x, y)[source]#

Jaro winkler similarity.

Return type:

float

architxt.similarity.levenshtein(x, y)[source]#

Levenshtein similarity.

Return type:

float

architxt.similarity.similarity(x, y, *, metric=DEFAULT_METRIC, decay=DECAY, max_sim_ctx_depth=MAX_SIM_CTX_DEPTH)[source]#

Compute the similarity between two tree objects based on their entity labels and context.

The function uses a specified metric (such as Jaccard, Levenshtein, or Jaro-Winkler) to calculate the similarity between the labels of entities in the trees. The similarity is computed as a recursive weighted mean for each tree anestor, where the weight decays with the distance from the tree.

\[\text{similarity}_\text{metric}(x, y) = \frac{\sum_{i=0}^{d_{\min}} \text{decay}^{-i} \cdot \text{metric}(P^x_i, P^y_i)} {\sum_{i=0}^{d_{\min}} \text{decay}^{-i}}\]

where \(P^x_i\) and \(P^y_i\) are the \(i^\text{th}\) parent nodes of \(x\) and \(y\) respectively, and \(d_{\\min}\) is the depth of the shallowest tree from \(x\) and \(y\) up to the root (or a fixed maximum depth of max_sim_ctx_depth).

Parameters:
  • x (Tree) – The first tree object.

  • y (Tree) – The second tree object.

  • metric (Callable[Collection[str], Collection[str], float]) – A metric function to compute the similarity between the entity labels of the two trees.

  • decay (float) – The decay factor for the weighted mean. Must be strictly greater than 0. The higher the value, the more the weight of context decreases with distance.

  • max_sim_ctx_depth (int) – The maximum depth of context to consider when computing similarity.

Return type:

float

Returns:

A similarity score between 0 and 1, where 1 indicates maximum similarity.

>>> from architxt.tree import Tree
>>> t = Tree.fromstring('(S (X (ENT::person Alice) (ENT::fruit apple)) (Y (ENT::person Bob) (ENT::animal rabbit)))')
>>> similarity(t[0], t[1], metric=jaccard)
0.5555555555555555