architxt.simplification.tree_rewriting.utils#

Functions

distribute_evenly(trees, n)

Distribute a collection of trees into n sub-collections with approximately equal total complexity.

log_clusters(iteration, equiv_subtrees)

Log information about the clusters of equivalent subtrees.

log_instance_comparison_metrics(iteration, ...)

Log comparison metrics to see the evolution of the rewriting for a specific iteration.

log_metrics(iteration, forest[, equiv_subtrees])

Log various metrics related to a forest of trees and equivalent subtrees.

log_schema(iteration, forest)

Log the schema to MLFlow.

architxt.simplification.tree_rewriting.utils.distribute_evenly(trees, n)[source]#

Distribute a collection of trees into n sub-collections with approximately equal total complexity.

Complexity is determined by the number of leaves in each tree. The function attempts to create n chunks, but if there are fewer elements than n, it will create one chunk per element.

Parameters:
  • trees (Collection[Tree]) – A collection of trees.

  • n (int) – The number of sub-collections to create.

Return type:

list[list[Tree]]

Returns:

A list of n sub-collections, with trees distributed to balance complexity.

Raises:

ValueError – If n is less than 1.

architxt.simplification.tree_rewriting.utils.log_clusters(iteration, equiv_subtrees)[source]#

Log information about the clusters of equivalent subtrees.

This function processes each cluster of subtrees, extracting the entity labels, count, and maximum label length, and then logs this information using MLFlow.

Parameters:
  • iteration (int) – The current iteration number.

  • equiv_subtrees (set[tuple[Tree, …]]) – The set of equivalent subtrees to process.

Return type:

None

architxt.simplification.tree_rewriting.utils.log_instance_comparison_metrics(iteration, old_forest, new_forest, tau, metric)[source]#

Log comparison metrics to see the evolution of the rewriting for a specific iteration.

Parameters:
  • iteration (int) – The current iteration number for logging.

  • old_forest (Collection[Tree]) – The initial forest to compare against.

  • new_forest (Collection[Tree]) – The updated forest to compare with.

  • tau (float) – The similarity threshold for clustering.

  • metric (Callable[Collection[str], Collection[str], float]) – The similarity metric function used to compute the similarity between subtrees.

Return type:

None

architxt.simplification.tree_rewriting.utils.log_metrics(iteration, forest, equiv_subtrees=None)[source]#

Log various metrics related to a forest of trees and equivalent subtrees.

This function calculates and logs the metrics that provide insights into the forest’s structure, including counts of production rules, labeled and unlabeled nodes, and entity/group/collection/relation statistics.

Parameters:
  • iteration (int) – The current iteration number for logging.

  • forest (Collection[Tree]) – A forest of tree objects to analyze.

  • equiv_subtrees (Optional[set[tuple[Tree, …]]]) – A set of clusters representing equivalent subtrees.

Return type:

None

Returns:

None

architxt.simplification.tree_rewriting.utils.log_schema(iteration, forest)[source]#

Log the schema to MLFlow.

Parameters:
  • iteration (int) – The current iteration number for logging.

  • forest (Collection[Tree]) – A forest of tree objects to analyze.

Return type:

None