architxt.simplification.llm#

Functions

count_tokens(llm, trees)

Count the number of tokens in the prompt for a set of trees.

estimate_tokens(trees, llm, max_tokens, *[, ...])

Estimate the total number of tokens (input/output) and queries required for a rewrite.

extract_vocab(forest, min_support, ...[, ...])

Extract a normalized set of labels that appear in GROUP or REL subtrees with at least a given support.

llm_rewrite(forest, llm, max_tokens[, tau, ...])

Rewrite a forest into a valid schema using a LLM agent.

llm_simplify(llm, max_tokens, prompt, trees, *)

Simplify parse trees using an LLM.

architxt.simplification.llm.count_tokens(llm, trees)[source]#

Count the number of tokens in the prompt for a set of trees.

Parameters:
  • llm (BaseLanguageModel) – LLM model to use.

  • trees (Iterable[Tree]) – Sequence of trees to simplify.

Return type:

int

Returns:

Number of tokens in the formatted prompt.

architxt.simplification.llm.estimate_tokens(trees, llm, max_tokens, *, prompt=DEFAULT_PROMPT, refining_steps=0, error_adjustment=1.2)[source]#

Estimate the total number of tokens (input/output) and queries required for a rewrite.

Parameters:
  • trees (Iterable[Tree]) – Sequence of trees to simplify.

  • llm (BaseLanguageModel) – LM model to use.

  • max_tokens (int) – Maximum number of tokens to allow per prompt.

  • prompt (BasePromptTemplate) – Prompt template to use.

  • refining_steps (int) – Number of refining steps to perform after the initial rewrite.

  • error_adjustment (float) – Factor to adjust the estimated number of tokens for error.

Return type:

tuple[int, int, int]

Returns:

The total number of tokens (input/output) and the number of queries estimated for a rewrite.

architxt.simplification.llm.extract_vocab(forest, min_support, min_similarity, close_match=3)[source]#

Extract a normalized set of labels that appear in GROUP or REL subtrees with at least a given support.

  • Normalization: Unicode NFKC, remove non-alphanumeric chars, collapse spaces, upper snake_case.

  • Aggregation: merge labels that are similar above min_similarity (SequenceMatcher ratio).

    We select the one with the most occurrences as canonical label if multiple match.

  • Returns: Set of canonical labels.

Parameters:
  • forest (Collection[Tree]) – Forest to extract vocabulary from.

  • min_support (int) – Minimum support threshold for vocabulary.

  • min_similarity (float) – Similarity threshold in [0, 1] for merging labels.

  • close_match (int) – Number of close matches to consider when merging labels.

Return type:

set[str]

Returns:

Set of canonical labels.

async architxt.simplification.llm.llm_rewrite(forest, llm, max_tokens, tau=0.7, decay=DECAY, min_support=None, vocab_similarity=0.6, refining_steps=0, debug=False, intermediate_output_path=None, task_limit=1, metric=DEFAULT_METRIC, prompt=DEFAULT_PROMPT, commit=True)[source]#

Rewrite a forest into a valid schema using a LLM agent.

Parameters:
  • forest (Forest) – A forest to be rewritten in place.

  • llm (BaseChatModel) – The LLM model to interact with for rewriting and simplification tasks.

  • max_tokens (int) – The token limit of the prompt.

  • tau (float) – Threshold for subtree similarity when clustering.

  • decay (float) – The similarity decay factor. The higher the value, the more the weight of context decreases with distance.

  • min_support (int | None) – Minimum support for vocab.

  • vocab_similarity (float) – Similarity threshold in [0, 1] for merging vocabulary labels.

  • refining_steps (int) – Number of refining steps to perform after the initial rewrite.

  • debug (bool) – Whether to enable debug logging.

  • intermediate_output_path (Path | None) – Optional path to save intermediate results after each iteration.

  • task_limit (int) – Maximum number of concurrent requests to make.

  • metric (METRIC_FUNC) – The metric function used to compute similarity between subtrees.

  • prompt (ChatPromptTemplate) – The prompt template to use for the LLM during the simplification.

  • commit (bool | int) – Commit automatically if using TreeBucket. If already in a transaction, no commit is applied. - If False, no commits are made, it relies on the current transaction. - If True (default), commits in batch. - If an integer, commits every N tree. To avoid memory issues, we recommend using incremental commit with large iterables.

Return type:

Metrics

Returns:

A Metrics object encapsulating the results and metrics calculated for the LLM rewrite process.

async architxt.simplification.llm.llm_simplify(llm, max_tokens, prompt, trees, *, vocab=None, vocab_similarity=0.6, task_limit=4, debug=False)[source]#

Simplify parse trees using an LLM.

It uses the following flow where the tree parser falls back to the original tree in case of parsing errors:

        ---
config:
  theme: neutral
---
flowchart LR
    A[Trees] --> B[Convert to JSON] --> C[LLM]
    A & C --> E[Tree parser]
    E --> F[Simplified trees]
    
Parameters:
  • llm (BaseChatModel) – LLM model to use.

  • max_tokens (int) – Maximum number of tokens to allow per prompt.

  • prompt (ChatPromptTemplate) – Prompt template to use.

  • trees (Iterable[Tree]) – Sequence of trees to simplify.

  • vocab (Collection[str] | None) – Optional list of vocabulary words to use in the prompt.

  • vocab_similarity (float) – Similarity threshold in [0, 1] for merging labels.

  • task_limit (int) – Maximum number of concurrent requests to make.

  • debug (bool) – Whether to enable debug logging.

Yield:

Simplified trees objects with the same oid as input.

Return type:

AsyncGenerator[tuple[Tree, bool], None]