Simplification#
ArchiTXT provides powerful tools to simplify the structure of your database. Simplification iteratively rewrites the instance to discover and uniformize latent structures (Groups, Relations). It can be used to discover a schema from text or automatically merge multiple datasource together (see Usage examples for more examples).
There are two main approaches to simplification in ArchiTXT:
Rule-Based Simplification: Uses statistical analysis and predefined tree rewriting operations.
LLM-Based Simplification: Leverages Large Language Models to semantically simplify the schema.
Rule-Based Simplification#
The simplification algorithm operates iteratively (defined by epoch). In each iteration, it performs the following steps:
Clustering: Subtrees within the forest are clustered based on their structural and semantic similarity. The similarity is controlled by the
tau(threshold) anddecayparameters.Pattern Discovery: Frequent patterns in these clusters (occurring more than
min_supporttimes) are identified as potential candidates for structural abstraction (e.g., forming a Group).Rewriting: A sequence of Edit Operations is applied to the trees. These operations transform the tree structure based on the discovered patterns.
$ architxt simplify my_database.fs --tau=0.7 --decay=3 --epoch=100
from architxt.bucket.zodb import ZODBTreeBucket
from architxt.simplification.tree_rewriting import rewrite
import anyio
async def main():
with ZODBTreeBucket(storage_path="my_database.fs") as forest:
rewrite(forest, tau=.7, decay=3, epoch=100)
if __name__ == "__main__":
anyio.run(main)
Key Parameters#
- Tau (τ)
The similarity threshold for clustering. A value between 0 and 1. Higher values mean subtrees must be more similar to be grouped together. * High τ (e.g., 0.9): Conservative clustering, leads to more specific groups. * Low τ (e.g., 0.4): Aggressive clustering, leads to more generic groups.
- Decay (δ)
Controls how much the context (surrounding nodes) influences the similarity calculation. * Higher decay reduces the influence of distant nodes.
- Min Support
The minimum number of occurrences required for a pattern to be considered significant. Patterns appearing less frequently are ignored.
Edit Operations#
The simplification process applies a sequence of operations. The standard operations include:
FindSubGroups: Identifies sets of entities that frequently appear together and groups them.
MergeGroups: Merges similar groups that might have been identified separately.
FindRelations: Identifies relationships between groups.
FindCollections: Identifies lists or sets of similar items (Groups or Relations).
ReduceTop/ReduceBottom: Prunes unnecessary levels of the tree hierarchy.
LLM-Based Simplification#
The simplify-llm command uses an LLM to identify and merge semantically equivalent structures that might be structurally different. This is computationally more intensive but can yield better results for complex schemas.
$ architxt simplify-llm my_database.fs --model HuggingFaceTB/SmolLM2-135M-Instruct --max-tokens=8000
from architxt.bucket.zodb import ZODBTreeBucket
from architxt.simplification.llm import estimate_tokens, llm_rewrite
from langchain.chat_models import init_chat_model
import anyio
async def main():
llm = init_chat_model("HuggingFaceTB/SmolLM2-135M-Instruct", max_tokens=8_000)
with ZODBTreeBucket(storage_path="my_database.fs") as forest:
await llm_rewrite(forest, llm, max_tokens=8_000)
if __name__ == "__main__":
anyio.run(main)
Key Parameters#
- Model
The identifier of the LLM model to use (e.g., from HuggingFace).
- Max tokens
Maximum number of token to use per prompt.
- Refining steps
Number of refinement steps to perform.
Note
LLM simplification requires additional dependencies. Install them with pip install architxt[llm].