MLFlow

MLFlow#

MLFlow is an open-source platform for managing the end-to-end machine learning lifecycle, including experiment tracking, model versioning, and deployment. With ArchiTXT’s integration, you can effortlessly log experiment executions, traces, and key metrics directly to MLFlow for streamlined monitoring and analysis.

Configure MLFlow#

Tip

By default, MLFlow logs experiments to a local directory. It is the recommended solution if you just want to try MLFlow.

To connect ArchiTXT to a remote MLFlow tracking server, set the environment variable MLFLOW_TRACKING_URI:

$ export MLFLOW_TRACKING_URI=http://127.0.0.1:5000  # Replace with your remote host

You can also set the tracking URI in your Python code:

import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:5000")  # Replace with your remote host

Run Experiments#

ArchiTXT can log experiments to MLFlow if it is executed within an active MLFlow run. In Python, you can create a run as follows:

import mlflow

with mlflow.start_run():
    ... # <- Your code here

Once the run is started, execute your experiments as usual, and ArchiTXT will automatically handle the logging.

You can also enable MLFlow logging whe using the CLI by using the --log option.

Explore Data#

Visualize your logged data using the MLFlow web interface. If running locally, you can start the MLFlow UI by running:

$ mlflow ui

Open your browser and navigate to the default URL (usually http://127.0.0.1:5000). In the web UI, you can review your experiment details and performance metrics.

Simplification#

During the simplification, the following metrics are logged to MLFlow by the architxt.metrics.Metrics.log_to_mlflow() method after each iteration. Additional debug artifacts may be logged when debug=True.

Parameters#

Parameter Name	Range / Type	Description
`nb_sentences`	`int`	Total number trees in the forest.
`tau`	`[0, 1]`	The threshold for subtree similarity.
`epoch`	`int`	The maximum number of iteration.
`min_support`	`int`	The minimum support for structures to be considered frequent.
`metric`	`str`	The name of the metric used for the tree similarity.
`edit_ops`	`str`	The list of operations that will be applied on the trees.

General Metrics#

Metric Name	Range / Type	Description
`nodes.count`	`int`	Total number of nodes in the forest.
`unlabeled.count`	`int`	Number of nodes that have no associated label.
`redundancy`	`[0, 1]`	Median redundancy score of attribute groups exceeding a functional dependency threshold.

Clustering#

Metric Name	Range / Type	Description
`clustering.cluster_count`	`int`	Number of distinct clusters in the current forest.
`clustering.ami`	`[-1, 1]`	Adjusted Mutual Information between original and current clustering.
`clustering.completeness`	`[0, 1]`	Measures if all members of a class are assigned to the same cluster.

Entities#

Metric Name	Range / Type	Description
`entities.coverage`	`[0, 1]`	Jaccard similarity between original and current entity sets.
`entities.count`	`int`	Total number of entity-type nodes.
`entities.distinct_count`	`int`	Number of distinct entity labels.
`entities.ratio`	`[0, 1]`	Average number of entity nodes per distinct entity label.

Groups#

Metric Name	Range / Type	Description
`groups.count`	`int`	Total number of group-type nodes.
`groups.distinct_count`	`int`	Number of distinct group labels.
`groups.ratio`	`[0, 1]`	Average number of group nodes per distinct group label.

Relations#

Metric Name	Range / Type	Description
`relations.count`	`int`	Total number of relation-type nodes.
`relations.distinct_count`	`int`	Number of distinct relation labels.
`relations.ratio`	`[0, 1]`	Average number of relation nodes per distinct relation label.

Collections#

Metric Name	Range / Type	Description
`collections.count`	`int`	Total number of collection-type nodes.
`collections.distinct_count`	`int`	Number of distinct collection labels.
`collections.ratio`	`[0, 1]`	Average number of collection nodes per distinct collection label.

Schema#

Metric Name	Range / Type	Description
`schema.overlap`	`[0, 1]`	Overlap ratio of attribute groups in the current schema.
`schema.balance`	`[0, 1]`	Balance score of group sizes in the current schema.
`schema.productions`	`int`	Number of productions (grammar rules) in the current schema.
`schema.non_terminal`	`int`	Number of non-terminal symbols (labels) in the current schema.