architxt.simplification.llm

architxt.simplification.llm#

Functions

`count_tokens`(llm, trees)	Count the number of tokens in the prompt for a set of trees.
`estimate_tokens`(trees, llm, max_tokens, *[, ...])	Estimate the total number of tokens (input/output) and queries required for a rewrite.
`extract_vocab`(forest, min_support, ...[, ...])	Extract a normalized set of labels that appear in GROUP or REL subtrees with at least a given support.
`llm_rewrite`(forest, llm, max_tokens[, tau, ...])	Rewrite a forest into a valid schema using a LLM agent.
`llm_simplify`(llm, max_tokens, prompt, trees, *)	Simplify parse trees using an LLM.

architxt.simplification.llm.count_tokens(llm, trees)[source]#

Count the number of tokens in the prompt for a set of trees.

Parameters:

llm (BaseLanguageModel) – LLM model to use.
trees (Iterable[Tree]) – Sequence of trees to simplify.

Return type:

int

Returns:

Number of tokens in the formatted prompt.

architxt.simplification.llm.estimate_tokens(trees, llm, max_tokens, *, prompt=DEFAULT_PROMPT, refining_steps=0, error_adjustment=1.2)[source]#

Estimate the total number of tokens (input/output) and queries required for a rewrite.

Parameters:

trees (Iterable[Tree]) – Sequence of trees to simplify.
llm (BaseLanguageModel) – LM model to use.
max_tokens (int) – Maximum number of tokens to allow per prompt.
prompt (BasePromptTemplate) – Prompt template to use.
refining_steps (int) – Number of refining steps to perform after the initial rewrite.
error_adjustment (float) – Factor to adjust the estimated number of tokens for error.

Return type:

tuple[int, int, int]

Returns:

The total number of tokens (input/output) and the number of queries estimated for a rewrite.

architxt.simplification.llm.extract_vocab(forest, min_support, min_similarity, close_match=3)[source]#

Extract a normalized set of labels that appear in GROUP or REL subtrees with at least a given support.

Normalization: Unicode NFKC, remove non-alphanumeric chars, collapse spaces, upper snake_case.
Aggregation: merge labels that are similar above min_similarity (SequenceMatcher ratio).
We select the one with the most occurrences as canonical label if multiple match.
Returns: Set of canonical labels.

Parameters:

forest (Collection[Tree]) – Forest to extract vocabulary from.
min_support (int) – Minimum support threshold for vocabulary.
min_similarity (float) – Similarity threshold in [0, 1] for merging labels.
close_match (int) – Number of close matches to consider when merging labels.

Return type:

set[str]

Returns:

Set of canonical labels.

async architxt.simplification.llm.llm_rewrite(forest, llm, max_tokens, tau=0.7, decay=DECAY, min_support=None, vocab_similarity=0.6, refining_steps=0, debug=False, intermediate_output_path=None, task_limit=1, metric=DEFAULT_METRIC, prompt=DEFAULT_PROMPT, commit=True)[source]#

Rewrite a forest into a valid schema using a LLM agent.

Parameters:

forest (Forest) – A forest to be rewritten in place.
llm (BaseChatModel) – The LLM model to interact with for rewriting and simplification tasks.
max_tokens (int) – The token limit of the prompt.
tau (float) – Threshold for subtree similarity when clustering.
decay (float) – The similarity decay factor. The higher the value, the more the weight of context decreases with distance.
min_support (int | None) – Minimum support for vocab.
vocab_similarity (float) – Similarity threshold in [0, 1] for merging vocabulary labels.
refining_steps (int) – Number of refining steps to perform after the initial rewrite.
debug (bool) – Whether to enable debug logging.
intermediate_output_path (Path | None) – Optional path to save intermediate results after each iteration.
task_limit (int) – Maximum number of concurrent requests to make.
metric (METRIC_FUNC) – The metric function used to compute similarity between subtrees.
prompt (ChatPromptTemplate) – The prompt template to use for the LLM during the simplification.
commit (bool | int) – Commit automatically if using TreeBucket. If already in a transaction, no commit is applied. - If False, no commits are made, it relies on the current transaction. - If True (default), commits in batch. - If an integer, commits every N tree. To avoid memory issues, we recommend using incremental commit with large iterables.

Return type:

Metrics

Returns:

A Metrics object encapsulating the results and metrics calculated for the LLM rewrite process.

async architxt.simplification.llm.llm_simplify(llm, max_tokens, prompt, trees, *, vocab=None, vocab_similarity=0.6, task_limit=4, debug=False)[source]#

Simplify parse trees using an LLM.

It uses the following flow where the tree parser falls back to the original tree in case of parsing errors:

        ---
config:
  theme: neutral
---
flowchart LR
    A[Trees] --> B[Convert to JSON] --> C[LLM]
    A & C --> E[Tree parser]
    E --> F[Simplified trees]

Parameters:

llm (BaseChatModel) – LLM model to use.
max_tokens (int) – Maximum number of tokens to allow per prompt.
prompt (ChatPromptTemplate) – Prompt template to use.
trees (Iterable[Tree]) – Sequence of trees to simplify.
vocab (Collection[str] | None) – Optional list of vocabulary words to use in the prompt.
vocab_similarity (float) – Similarity threshold in [0, 1] for merging labels.
task_limit (int) – Maximum number of concurrent requests to make.
debug (bool) – Whether to enable debug logging.

Yield:

Simplified trees objects with the same oid as input.

Return type:

AsyncGenerator[tuple[Tree, bool], None]

architxt.simplification.llm

Contents

architxt.simplification.llm#