Integrating multiple databases#

In this example, we use ArchiTXT to integrate two databases (Northwind and Chinook).

Load databases#

Northwind#

The Northwind sample database was provided with Microsoft Access as a tutorial schema for managing small business customers, orders, inventory, purchasing, suppliers, shipping, and employees. Northwind is an excellent tutorial schema for a small-business ERP, with customers, orders, inventory, purchasing, suppliers, shipping, employees, and single-entry accounting.

            ---
    config:
        theme: neutral
    ---
    erDiagram
        CustomerCustomerDemo }o--|| CustomerDemographics : "has"
        CustomerCustomerDemo }o--|| Customers : "relates to"

        Employees }|--o| Employees : "reports to"
        Employees ||--o{ EmployeeTerritories : "assigned to"

        Orders }o--|| Shippers : "ships via"
        Orders }o--|| Customers : "placed by"
        Orders }o--|| Employees : "sold by"

        "Order Details" }o--|| Orders : "contain"
        "Order Details" }o--|| Products : "refer to"

        Products }o--|| Categories : "categorized in"
        Products }o--|| Suppliers : "supplied by"

        Territories }|--|| Regions : "located in"
        EmployeeTerritories }o--|| Territories : "assigned to"

        Categories {
            int     CategoryID      PK
            string  CategoryName
            string  Description
            blob    Picture
        }

        CustomerCustomerDemo {
            string  CustomerID       PK, FK
            string  CustomerTypeID   PK, FK
        }

        CustomerDemographics {
            string  CustomerTypeID   PK
            string  CustomerDesc
        }

        Customers {
            string  CustomerID       PK
            string  CompanyName
            string  ContactName
            string  ContactTitle
            string  Address
            string  City
            string  Region
            string  PostalCode
            string  Country
            string  Phone
            string  Fax
        }

        Employees {
            int     EmployeeID       PK
            string  LastName
            string  FirstName
            string  Title
            string  TitleOfCourtesy
            date    BirthDate
            date    HireDate
            string  Address
            string  City
            string  Region
            string  PostalCode
            string  Country
            string  HomePhone
            string  Extension
            blob    Photo
            string  Notes
            int     ReportsTo        FK
            string  PhotoPath
        }

        EmployeeTerritories {
            int     EmployeeID       PK, FK
            int     TerritoryID      PK, FK
        }

        "Order Details" {
            int     OrderID          PK, FK
            int     ProductID        PK, FK
            float   UnitPrice
            int     Quantity
            real    Discount
        }

        Orders {
            int     OrderID          PK
            string  CustomerID       FK
            int     EmployeeID       FK
            datetime OrderDate
            datetime RequiredDate
            datetime ShippedDate
            int     ShipVia          FK
            numeric Freight
            string  ShipName
            string  ShipAddress
            string  ShipCity
            string  ShipRegion
            string  ShipPostalCode
            string  ShipCountry
        }

        Products {
            int     ProductID        PK
            string  ProductName
            int     SupplierID       FK
            int     CategoryID       FK
            int     QuantityPerUnit
            float   UnitPrice
            int     UnitsInStock
            int     UnitsOnOrder
            int     ReorderLevel
            string  Discontinued
        }

        Regions {
            int     RegionID         PK
            string  RegionDescription
        }

        Shippers {
            int     ShipperID        PK
            string  CompanyName
            string  Phone
        }

        Suppliers {
            int     SupplierID       PK
            string  CompanyName
            string  ContactName
            string  ContactTitle
            string  Address
            string  City
            string  Region
            string  PostalCode
            string  Country
            string  Phone
            string  Fax
            string  HomePage
        }

        Territories {
            string  TerritoryID      PK
            string  TerritoryDescription
            int     RegionID         FK
        }
    
import urllib.request
from pathlib import Path

db_file = Path('northwind.db')

if not db_file.exists():
    urllib.request.urlretrieve(
        'https://github.com/jpwhite3/northwind-SQLite3/raw/refs/heads/main/dist/northwind.db',
        filename=db_file,
    )
from architxt.database.loader import read_database
from sqlalchemy import create_engine

with create_engine('sqlite:///northwind.db').connect() as connection:
    northwind_forest = list(read_database(connection, sample=20))
/home/runner/.cache/pypoetry/virtualenvs/architxt-r2jGmZOI-py3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
CustomerCustomerDemo: 0it [00:00, ?it/s]
CustomerCustomerDemo: 0it [00:00, ?it/s]
EmployeeTerritories: 0it [00:00, ?it/s]
EmployeeTerritories: 20it [00:00, 609.61it/s]
Employees: 0it [00:00, ?it/s]
Employees: 9it [00:00, 970.40it/s]
Order Details: 0it [00:00, ?it/s]
Order Details: 20it [00:00, 259.83it/s]

Chinook#

The Chinook Sample Database is a fictional digital media store database used for learning and testing SQL queries. It is modeled after real-world online music stores like iTunes and includes data related to artists, albums, tracks, customers, invoices, employees, and genres.

            ---
    config:
        theme: neutral
    ---
    erDiagram
        Artist ||--|{ Album : "hasAuthor"

        Employee ||--|{ Customer : "supports"
        Employee ||--|{ Employee : "reportsTo"

        Customer ||--|{ Invoice : "makeOrder"
        Invoice ||--|{ InvoiceLine : "contains"
        Track ||--|{ InvoiceLine : "soldIn"

        Playlist ||--|{ PlaylistTrack : "includes"
        Track ||--|{ PlaylistTrack : "appearsIn"

        Album ||--|{ Track : "has"
        Genre ||--|{ Track : "hasGenre"
        MediaType ||--|{ Track : "encodedAs"

        Album {
            int     AlbumId   PK
            string  Title
            int     ArtistId  FK
        }
        Artist {
            int     ArtistId  PK
            string  Name
        }
        Customer {
            int     CustomerId    PK
            string  FirstName
            string  LastName
            string  Company
            string  Address
            string  City
            string  State
            string  Country
            string  PostalCode
            string  Phone
            string  Fax
            string  Email
            int     SupportRepId  FK
        }
        Employee {
            int     EmployeeId  PK
            string  LastName
            string  FirstName
            string  Title
            int     ReportsTo   FK
            date    BirthDate
            date    HireDate
            string  Address
            string  City
            string  State
            string  Country
            string  PostalCode
            string  Phone
            string  Fax
            string  Email
        }
        Genre {
            int     GenreId     PK
            string  Name
        }
        Invoice {
            int     InvoiceId           PK
            int     CustomerId          FK
            date    InvoiceDate
            string  BillingAddress
            string  BillingCity
            string  BillingState
            string  BillingCountry
            string  BillingPostalCode
            float   Total
        }
        InvoiceLine {
            int     InvoiceLineId   PK
            int     InvoiceId       FK
            int     TrackId         FK
            float   UnitPrice
            int     Quantity
        }
        MediaType {
            int     MediaTypeId PK
            string  Name
        }
        Playlist {
            int     PlaylistId  PK
            string  Name
        }
        PlaylistTrack {
            int PlaylistId  PK, FK
            int TrackId     PK, FK
        }
        Track {
            int     TrackId     PK
            string  Name
            int     AlbumId     FK
            int     MediaTypeId FK
            int     GenreId     FK
            string  Composer
            int     Milliseconds
            int     Bytes
            float   UnitPrice
        }
    
db_file = Path('chinook.db')

if not db_file.exists():
    urllib.request.urlretrieve(
        'https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite',
        filename=db_file,
    )
with create_engine('sqlite:///chinook.db').connect() as connection:
    chinook_forest = list(read_database(connection, sample=10))

Integration#

We combine the two forests representing the databases and inspect the schema.

from architxt.schema import Schema

forest = northwind_forest + chinook_forest
schema = Schema.from_forest(forest)
print(schema.as_cfg())
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType ROOT;
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType;
REL::EmployeeTerritories -> GROUP::Employees GROUP::Territories;
REL::Employees<->Employees -> GROUP::Employees GROUP::Employees;
REL::Territories<->Regions -> GROUP::Regions GROUP::Territories;
REL::OrderDetails<->Orders -> GROUP::OrderDetails GROUP::Orders;
REL::Orders<->Employees -> GROUP::Employees GROUP::Orders;
REL::Orders<->Shippers -> GROUP::Orders GROUP::Shippers;
REL::Orders<->Customers -> GROUP::Customers GROUP::Orders;
REL::OrderDetails<->Products -> GROUP::OrderDetails GROUP::Products;
REL::Products<->Categories -> GROUP::Categories GROUP::Products;
REL::Products<->Suppliers -> GROUP::Products GROUP::Suppliers;
REL::PlaylistTrack -> GROUP::Playlist GROUP::Track;
REL::Track<->Genre -> GROUP::Genre GROUP::Track;
REL::Track<->Album -> GROUP::Album GROUP::Track;
REL::Album<->Artist -> GROUP::Album GROUP::Artist;
REL::Track<->MediaType -> GROUP::MediaType GROUP::Track;
REL::Customer<->Employee -> GROUP::Customer GROUP::Employee;
REL::Employee<->Employee -> GROUP::Employee GROUP::Employee;
REL::InvoiceLine<->Track -> GROUP::InvoiceLine GROUP::Track;
REL::InvoiceLine<->Invoice -> GROUP::Invoice GROUP::InvoiceLine;
REL::Invoice<->Customer -> GROUP::Customer GROUP::Invoice;
GROUP::Employees -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::EmployeeID ENT::Extension ENT::FirstName ENT::HireDate ENT::HomePhone ENT::LastName ENT::Notes ENT::Photo ENT::PhotoPath ENT::PostalCode ENT::Region ENT::ReportsTo ENT::Title ENT::TitleOfCourtesy;
GROUP::Territories -> ENT::RegionID ENT::TerritoryDescription ENT::TerritoryID;
GROUP::Regions -> ENT::RegionDescription ENT::RegionID;
GROUP::OrderDetails -> ENT::Discount ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Orders -> ENT::CustomerID ENT::EmployeeID ENT::Freight ENT::OrderDate ENT::OrderID ENT::RequiredDate ENT::ShipAddress ENT::ShipCity ENT::ShipCountry ENT::ShipName ENT::ShipPostalCode ENT::ShipRegion ENT::ShipVia ENT::ShippedDate;
GROUP::Shippers -> ENT::CompanyName ENT::Phone ENT::ShipperID;
GROUP::Customers -> ENT::Address ENT::City ENT::CompanyName ENT::ContactName ENT::ContactTitle ENT::Country ENT::CustomerID ENT::Fax ENT::Phone ENT::PostalCode ENT::Region;
GROUP::Products -> ENT::CategoryID ENT::Discontinued ENT::ProductID ENT::ProductName ENT::QuantityPerUnit ENT::ReorderLevel ENT::SupplierID ENT::UnitPrice ENT::UnitsInStock ENT::UnitsOnOrder;
GROUP::Categories -> ENT::CategoryID ENT::CategoryName ENT::Description ENT::Picture;
GROUP::Suppliers -> ENT::Address ENT::City ENT::CompanyName ENT::ContactName ENT::ContactTitle ENT::Country ENT::Fax ENT::HomePage ENT::Phone ENT::PostalCode ENT::Region ENT::SupplierID;
GROUP::Track -> ENT::AlbumId ENT::Bytes ENT::Composer ENT::GenreId ENT::MediaTypeId ENT::Milliseconds ENT::Name ENT::TrackId ENT::UnitPrice;
GROUP::Playlist -> ENT::Name ENT::PlaylistId;
GROUP::Genre -> ENT::GenreId ENT::Name;
GROUP::Album -> ENT::AlbumId ENT::ArtistId ENT::Title;
GROUP::Artist -> ENT::ArtistId ENT::Name;
GROUP::MediaType -> ENT::MediaTypeId ENT::Name;
GROUP::Customer -> ENT::Address ENT::City ENT::Company ENT::Country ENT::CustomerId ENT::Email ENT::Fax ENT::FirstName ENT::LastName ENT::Phone ENT::PostalCode ENT::State ENT::SupportRepId;
GROUP::Employee -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::Email ENT::EmployeeId ENT::Fax ENT::FirstName ENT::HireDate ENT::LastName ENT::Phone ENT::PostalCode ENT::ReportsTo ENT::State ENT::Title;
GROUP::InvoiceLine -> ENT::InvoiceId ENT::InvoiceLineId ENT::Quantity ENT::TrackId ENT::UnitPrice;
GROUP::Invoice -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerId ENT::InvoiceDate ENT::InvoiceId ENT::Total;

Explore the schema#

The schema is complex, and some groups like Orders and Invoice may be merged to simplify the instance. We first need to rename semantically equivalent properties to make them uniform across both databases.

We’ll use a mapping dictionary to rename entities and standardize them.

from architxt.tree import NodeLabel, NodeType, has_type

entity_mapping = {
    # Track / Product
    'TrackId': 'ProductID',
    # Invoice / Order
    'InvoiceId': 'OrderID',
    'InvoiceDate': 'OrderDate',
    'ShipAddress': 'BillingAddress',
    'ShipCity': 'BillingCity',
    'ShipCountry': 'BillingCountry',
    'ShipPostalCode': 'BillingPostalCode',
    'BillingState': 'State',
    'ShipRegion': 'Region',
    # Employees / Customers
    'CompanyName': 'Company',
    'HomePhone': 'Phone',
    'CustomerId': 'CustomerID',
    'EmployeeId': 'EmployeeID',
}

for tree in forest:
    for subtree in tree.subtrees():
        if has_type(subtree, NodeType.ENT) and subtree.label().name in entity_mapping:
            subtree.set_label(NodeLabel(NodeType.ENT, entity_mapping[subtree.label().name]))

schema = Schema.from_forest(forest)
print(schema.as_cfg())
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType ROOT;
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType;
REL::EmployeeTerritories -> GROUP::Employees GROUP::Territories;
REL::Employees<->Employees -> GROUP::Employees GROUP::Employees;
REL::Territories<->Regions -> GROUP::Regions GROUP::Territories;
REL::OrderDetails<->Orders -> GROUP::OrderDetails GROUP::Orders;
REL::Orders<->Employees -> GROUP::Employees GROUP::Orders;
REL::Orders<->Shippers -> GROUP::Orders GROUP::Shippers;
REL::Orders<->Customers -> GROUP::Customers GROUP::Orders;
REL::OrderDetails<->Products -> GROUP::OrderDetails GROUP::Products;
REL::Products<->Categories -> GROUP::Categories GROUP::Products;
REL::Products<->Suppliers -> GROUP::Products GROUP::Suppliers;
REL::PlaylistTrack -> GROUP::Playlist GROUP::Track;
REL::Track<->Genre -> GROUP::Genre GROUP::Track;
REL::Track<->Album -> GROUP::Album GROUP::Track;
REL::Album<->Artist -> GROUP::Album GROUP::Artist;
REL::Track<->MediaType -> GROUP::MediaType GROUP::Track;
REL::Customer<->Employee -> GROUP::Customer GROUP::Employee;
REL::Employee<->Employee -> GROUP::Employee GROUP::Employee;
REL::InvoiceLine<->Track -> GROUP::InvoiceLine GROUP::Track;
REL::InvoiceLine<->Invoice -> GROUP::Invoice GROUP::InvoiceLine;
REL::Invoice<->Customer -> GROUP::Customer GROUP::Invoice;
GROUP::Employees -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::EmployeeID ENT::Extension ENT::FirstName ENT::HireDate ENT::LastName ENT::Notes ENT::Phone ENT::Photo ENT::PhotoPath ENT::PostalCode ENT::Region ENT::ReportsTo ENT::Title ENT::TitleOfCourtesy;
GROUP::Territories -> ENT::RegionID ENT::TerritoryDescription ENT::TerritoryID;
GROUP::Regions -> ENT::RegionDescription ENT::RegionID;
GROUP::OrderDetails -> ENT::Discount ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Orders -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::EmployeeID ENT::Freight ENT::OrderDate ENT::OrderID ENT::Region ENT::RequiredDate ENT::ShipName ENT::ShipVia ENT::ShippedDate;
GROUP::Shippers -> ENT::Company ENT::Phone ENT::ShipperID;
GROUP::Customers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::CustomerID ENT::Fax ENT::Phone ENT::PostalCode ENT::Region;
GROUP::Products -> ENT::CategoryID ENT::Discontinued ENT::ProductID ENT::ProductName ENT::QuantityPerUnit ENT::ReorderLevel ENT::SupplierID ENT::UnitPrice ENT::UnitsInStock ENT::UnitsOnOrder;
GROUP::Categories -> ENT::CategoryID ENT::CategoryName ENT::Description ENT::Picture;
GROUP::Suppliers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::Fax ENT::HomePage ENT::Phone ENT::PostalCode ENT::Region ENT::SupplierID;
GROUP::Track -> ENT::AlbumId ENT::Bytes ENT::Composer ENT::GenreId ENT::MediaTypeId ENT::Milliseconds ENT::Name ENT::ProductID ENT::UnitPrice;
GROUP::Playlist -> ENT::Name ENT::PlaylistId;
GROUP::Genre -> ENT::GenreId ENT::Name;
GROUP::Album -> ENT::AlbumId ENT::ArtistId ENT::Title;
GROUP::Artist -> ENT::ArtistId ENT::Name;
GROUP::MediaType -> ENT::MediaTypeId ENT::Name;
GROUP::Customer -> ENT::Address ENT::City ENT::Company ENT::Country ENT::CustomerID ENT::Email ENT::Fax ENT::FirstName ENT::LastName ENT::Phone ENT::PostalCode ENT::State ENT::SupportRepId;
GROUP::Employee -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::Email ENT::EmployeeID ENT::Fax ENT::FirstName ENT::HireDate ENT::LastName ENT::Phone ENT::PostalCode ENT::ReportsTo ENT::State ENT::Title;
GROUP::InvoiceLine -> ENT::InvoiceLineId ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Invoice -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::OrderDate ENT::OrderID ENT::Total;

Now that we’ve standardized the property names, we can visualize the initial distribution of groups in the schema.

To do this, let’s count how many instances of each group exist in the schema.

from collections import Counter

import pandas as pd
from architxt.tree import NodeType, has_type

group_counter = Counter(
    subtree.label().name for tree in forest for subtree in tree.subtrees() if has_type(subtree, NodeType.GROUP)
)

df_initial = pd.DataFrame.from_dict(group_counter, orient='index', columns=['Count'])
df_initial = df_initial.sort_values('Count', ascending=False)
df_initial.loc['Total'] = group_counter.total()
df_initial.index.name = 'Group'

df_initial
Count
Group
Loading ITables v2.3.0 from the internet... (need help?)

Next, we can use ArchiTXT’s tree clustering algorithm to identify which groups can be merged.

from architxt.similarity import equiv_cluster

clusters = equiv_cluster(forest, tau=0.85)
similarity:   0%|          | 0.00/797k [00:00<?, ?it/s]
similarity:   3%|▎         | 21.6k/797k [00:00<00:03, 216kit/s]
similarity:   7%|▋         | 56.8k/797k [00:00<00:02, 296kit/s]
similarity:  11%|█         | 88.2k/797k [00:00<00:02, 304kit/s]
similarity:  15%|█▍        | 119k/797k [00:00<00:02, 305kit/s]
similarity:  19%|█▊        | 149k/797k [00:00<00:02, 300kit/s]
similarity:  22%|██▏       | 179k/797k [00:00<00:02, 289kit/s]
similarity:  26%|██▌       | 209k/797k [00:00<00:02, 292kit/s]
similarity:  30%|██▉       | 238k/797k [00:00<00:01, 287kit/s]
similarity:  34%|███▎      | 267k/797k [00:00<00:02, 254kit/s]
similarity:  37%|███▋      | 293k/797k [00:01<00:02, 234kit/s]
similarity:  40%|███▉      | 317k/797k [00:01<00:02, 223kit/s]
similarity:  43%|████▎     | 340k/797k [00:01<00:02, 212kit/s]
similarity:  45%|████▌     | 361k/797k [00:01<00:02, 206kit/s]
similarity:  48%|████▊     | 382k/797k [00:01<00:02, 204kit/s]
similarity:  51%|█████     | 403k/797k [00:01<00:01, 203kit/s]
similarity:  53%|█████▎    | 423k/797k [00:01<00:01, 202kit/s]
similarity:  56%|█████▌    | 443k/797k [00:01<00:01, 202kit/s]
similarity:  58%|█████▊    | 464k/797k [00:01<00:01, 202kit/s]
similarity:  61%|██████    | 484k/797k [00:02<00:01, 203kit/s]
similarity:  63%|██████▎   | 505k/797k [00:02<00:01, 203kit/s]
similarity:  66%|██████▌   | 525k/797k [00:02<00:01, 203kit/s]
similarity:  68%|██████▊   | 545k/797k [00:02<00:01, 203kit/s]
similarity:  71%|███████   | 566k/797k [00:02<00:01, 204kit/s]
similarity:  74%|███████▎  | 586k/797k [00:02<00:01, 205kit/s]
similarity:  76%|███████▋  | 609k/797k [00:02<00:00, 210kit/s]
similarity:  79%|███████▉  | 630k/797k [00:02<00:00, 210kit/s]
similarity:  82%|████████▏ | 652k/797k [00:02<00:00, 215kit/s]
similarity:  85%|████████▍ | 675k/797k [00:02<00:00, 220kit/s]
similarity:  88%|████████▊ | 700k/797k [00:03<00:00, 228kit/s]
similarity:  91%|█████████ | 726k/797k [00:03<00:00, 236kit/s]
similarity:  95%|█████████▍| 754k/797k [00:03<00:00, 250kit/s]
similarity:  98%|█████████▊| 780k/797k [00:03<00:00, 252kit/s]

Let’s visualize the clustering result in a new table.

new_groups = {
    str(klass[0].label()): Counter(subtree.label().name for subtree in klass if has_type(subtree, NodeType.GROUP))
    for klass in clusters
}

df_new = pd.DataFrame.from_dict(new_groups, orient='index').fillna(0).astype(int)
df_new['Total'] = df_new.sum(axis=1)
df_new = df_new.sort_values('Total', ascending=False)
df_initial.index.name = 'Class'

df_new
Suppliers Customers MediaType Genre Shippers Employees Employee Customer Playlist Categories Territories OrderDetails InvoiceLine Artist Products Track Album Orders Invoice Regions Total
Loading ITables v2.3.0 from the internet... (need help?)

Let’s visualize the clustering result as a bar chart to better understand the distribution of groups across equivalent classes.

import plotly.express as px

fig = px.bar(
    df_new.drop(columns=['Total']),
    labels={'value': 'Count', 'variable': 'Group', 'index': 'Class'},
    title='Groups repartition per Equivalent Class',
)

fig.update_layout(barmode='stack', xaxis_title='Equivalent Class', yaxis_title='Count', xaxis_tickangle=-45)

Simplification#

Now that we saw potential groupings, we can use ArchiTXT to perform simplification and rewrite the forest into a condensed version.

from architxt.simplification.tree_rewriting import rewrite

new_forest = rewrite(forest, tau=0.85, min_support=20, epoch=5, edit_ops=[])
new_schema = Schema.from_forest(new_forest)
print(new_schema.as_cfg())
ROOT -> COLL::Employees GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType ROOT;
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType;
COLL::Employees -> GROUP::Employees;
REL::OrderDetails<->Orders -> GROUP::OrderDetails GROUP::Orders;
REL::Orders<->Employees -> GROUP::Employees GROUP::Orders;
REL::Employees<->Employees -> GROUP::Employees GROUP::Employees;
REL::Orders<->Shippers -> GROUP::Orders GROUP::Shippers;
REL::Orders<->Customers -> GROUP::Customers GROUP::Orders;
REL::OrderDetails<->Products -> GROUP::OrderDetails GROUP::Products;
REL::Products<->Categories -> GROUP::Categories GROUP::Products;
REL::Products<->Suppliers -> GROUP::Products GROUP::Suppliers;
REL::InvoiceLine<->Track -> GROUP::InvoiceLine GROUP::Track;
REL::Track<->Genre -> GROUP::Genre GROUP::Track;
REL::Track<->Album -> GROUP::Album GROUP::Track;
REL::Album<->Artist -> GROUP::Album GROUP::Artist;
REL::Track<->MediaType -> GROUP::MediaType GROUP::Track;
REL::InvoiceLine<->Invoice -> GROUP::Invoice GROUP::InvoiceLine;
REL::Invoice<->Customer -> GROUP::Customer GROUP::Invoice;
REL::Customer<->Employee -> GROUP::Customer GROUP::Employee;
REL::Employee<->Employee -> GROUP::Employee GROUP::Employee;
REL::EmployeeTerritories -> GROUP::Employees GROUP::Territories;
REL::Territories<->Regions -> GROUP::Regions GROUP::Territories;
REL::PlaylistTrack -> GROUP::Playlist GROUP::Track;
GROUP::OrderDetails -> ENT::Discount ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Orders -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::EmployeeID ENT::Freight ENT::OrderDate ENT::OrderID ENT::Region ENT::RequiredDate ENT::ShipName ENT::ShipVia ENT::ShippedDate;
GROUP::Employees -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::EmployeeID ENT::Extension ENT::FirstName ENT::HireDate ENT::LastName ENT::Notes ENT::Phone ENT::Photo ENT::PhotoPath ENT::PostalCode ENT::Region ENT::ReportsTo ENT::Title ENT::TitleOfCourtesy;
GROUP::Shippers -> ENT::Company ENT::Phone ENT::ShipperID;
GROUP::Customers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::CustomerID ENT::Fax ENT::Phone ENT::PostalCode ENT::Region;
GROUP::Products -> ENT::CategoryID ENT::Discontinued ENT::ProductID ENT::ProductName ENT::QuantityPerUnit ENT::ReorderLevel ENT::SupplierID ENT::UnitPrice ENT::UnitsInStock ENT::UnitsOnOrder;
GROUP::Categories -> ENT::CategoryID ENT::CategoryName ENT::Description ENT::Picture;
GROUP::Suppliers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::Fax ENT::HomePage ENT::Phone ENT::PostalCode ENT::Region ENT::SupplierID;
GROUP::InvoiceLine -> ENT::InvoiceLineId ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Track -> ENT::AlbumId ENT::Bytes ENT::Composer ENT::GenreId ENT::MediaTypeId ENT::Milliseconds ENT::Name ENT::ProductID ENT::UnitPrice;
GROUP::Genre -> ENT::GenreId ENT::Name;
GROUP::Album -> ENT::AlbumId ENT::ArtistId ENT::Title;
GROUP::Artist -> ENT::ArtistId ENT::Name;
GROUP::MediaType -> ENT::MediaTypeId ENT::Name;
GROUP::Invoice -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::OrderDate ENT::OrderID ENT::Total;
GROUP::Customer -> ENT::Address ENT::City ENT::Company ENT::Country ENT::CustomerID ENT::Email ENT::Fax ENT::FirstName ENT::LastName ENT::Phone ENT::PostalCode ENT::State ENT::SupportRepId;
GROUP::Employee -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::Email ENT::EmployeeID ENT::Fax ENT::FirstName ENT::HireDate ENT::LastName ENT::Phone ENT::PostalCode ENT::ReportsTo ENT::State ENT::Title;
GROUP::Territories -> ENT::RegionID ENT::TerritoryDescription ENT::TerritoryID;
GROUP::Regions -> ENT::RegionDescription ENT::RegionID;
GROUP::Playlist -> ENT::Name ENT::PlaylistId;
datasets = new_schema.extract_datasets(new_forest)
group = set(datasets.keys()).pop()

datasets[group]
CategoryID CategoryName Description Picture
Loading ITables v2.3.0 from the internet... (need help?)