architxt.simplification.llm#
Functions
|
Count the number of tokens in the prompt for a set of trees. |
|
Estimate the total number of tokens (input/output) and queries required for a rewrite. |
|
Extract a normalized set of labels that appear in GROUP or REL subtrees with at least a given support. |
|
Rewrite a forest into a valid schema using a LLM agent. |
|
Simplify parse trees using an LLM. |
- architxt.simplification.llm.count_tokens(llm, trees)[source]#
Count the number of tokens in the prompt for a set of trees.
- Parameters:
llm (
BaseLanguageModel) – LLM model to use.trees (
Iterable[Tree]) – Sequence of trees to simplify.
- Return type:
- Returns:
Number of tokens in the formatted prompt.
- architxt.simplification.llm.estimate_tokens(trees, llm, max_tokens, *, prompt=DEFAULT_PROMPT, refining_steps=0, error_adjustment=1.2)[source]#
Estimate the total number of tokens (input/output) and queries required for a rewrite.
- Parameters:
trees (
Iterable[Tree]) – Sequence of trees to simplify.llm (
BaseLanguageModel) – LM model to use.max_tokens (
int) – Maximum number of tokens to allow per prompt.prompt (
BasePromptTemplate) – Prompt template to use.refining_steps (
int) – Number of refining steps to perform after the initial rewrite.error_adjustment (
float) – Factor to adjust the estimated number of tokens for error.
- Return type:
tuple[int, int, int]- Returns:
The total number of tokens (input/output) and the number of queries estimated for a rewrite.
- architxt.simplification.llm.extract_vocab(forest, min_support, min_similarity, close_match=3)[source]#
Extract a normalized set of labels that appear in GROUP or REL subtrees with at least a given support.
Normalization: Unicode NFKC, remove non-alphanumeric chars, collapse spaces, upper snake_case.
- Aggregation: merge labels that are similar above min_similarity (SequenceMatcher ratio).
We select the one with the most occurrences as canonical label if multiple match.
Returns: Set of canonical labels.
- Parameters:
forest (
Collection[Tree]) – Forest to extract vocabulary from.min_support (
int) – Minimum support threshold for vocabulary.min_similarity (
float) – Similarity threshold in [0, 1] for merging labels.close_match (
int) – Number of close matches to consider when merging labels.
- Return type:
- Returns:
Set of canonical labels.
- async architxt.simplification.llm.llm_rewrite(forest, llm, max_tokens, tau=0.7, decay=DECAY, min_support=None, vocab_similarity=0.6, refining_steps=0, debug=False, intermediate_output_path=None, task_limit=1, metric=DEFAULT_METRIC, prompt=DEFAULT_PROMPT, commit=True)[source]#
Rewrite a forest into a valid schema using a LLM agent.
- Parameters:
forest (
Forest) – A forest to be rewritten in place.llm (
BaseChatModel) – The LLM model to interact with for rewriting and simplification tasks.max_tokens (
int) – The token limit of the prompt.tau (
float) – Threshold for subtree similarity when clustering.decay (
float) – The similarity decay factor. The higher the value, the more the weight of context decreases with distance.min_support (
int | None) – Minimum support for vocab.vocab_similarity (
float) – Similarity threshold in [0, 1] for merging vocabulary labels.refining_steps (
int) – Number of refining steps to perform after the initial rewrite.debug (
bool) – Whether to enable debug logging.intermediate_output_path (
Path | None) – Optional path to save intermediate results after each iteration.task_limit (
int) – Maximum number of concurrent requests to make.metric (
METRIC_FUNC) – The metric function used to compute similarity between subtrees.prompt (
ChatPromptTemplate) – The prompt template to use for the LLM during the simplification.commit (
bool | int) – Commit automatically if using TreeBucket. If already in a transaction, no commit is applied. - If False, no commits are made, it relies on the current transaction. - If True (default), commits in batch. - If an integer, commits every N tree. To avoid memory issues, we recommend using incremental commit with large iterables.
- Return type:
- Returns:
A Metrics object encapsulating the results and metrics calculated for the LLM rewrite process.
- async architxt.simplification.llm.llm_simplify(llm, max_tokens, prompt, trees, *, vocab=None, vocab_similarity=0.6, task_limit=4, debug=False)[source]#
Simplify parse trees using an LLM.
It uses the following flow where the tree parser falls back to the original tree in case of parsing errors:
--- config: theme: neutral --- flowchart LR A[Trees] --> B[Convert to JSON] --> C[LLM] A & C --> E[Tree parser] E --> F[Simplified trees]- Parameters:
llm (
BaseChatModel) – LLM model to use.max_tokens (
int) – Maximum number of tokens to allow per prompt.prompt (
ChatPromptTemplate) – Prompt template to use.trees (
Iterable[Tree]) – Sequence of trees to simplify.vocab (
Collection[str] | None) – Optional list of vocabulary words to use in the prompt.vocab_similarity (
float) – Similarity threshold in [0, 1] for merging labels.task_limit (
int) – Maximum number of concurrent requests to make.debug (
bool) – Whether to enable debug logging.
- Yield:
Simplified trees objects with the same oid as input.
- Return type:
AsyncGenerator[tuple[Tree, bool], None]