Integrating multiple databases#
In this example, we use ArchiTXT to integrate two databases (Northwind and Chinook).
Load databases#
Northwind#
The Northwind sample database was provided with Microsoft Access as a tutorial schema for managing small business customers, orders, inventory, purchasing, suppliers, shipping, and employees. Northwind is an excellent tutorial schema for a small-business ERP, with customers, orders, inventory, purchasing, suppliers, shipping, employees, and single-entry accounting.
--- config: theme: neutral --- erDiagram CustomerCustomerDemo }o--|| CustomerDemographics : "has" CustomerCustomerDemo }o--|| Customers : "relates to" Employees }|--o| Employees : "reports to" Employees ||--o{ EmployeeTerritories : "assigned to" Orders }o--|| Shippers : "ships via" Orders }o--|| Customers : "placed by" Orders }o--|| Employees : "sold by" "Order Details" }o--|| Orders : "contain" "Order Details" }o--|| Products : "refer to" Products }o--|| Categories : "categorized in" Products }o--|| Suppliers : "supplied by" Territories }|--|| Regions : "located in" EmployeeTerritories }o--|| Territories : "assigned to" Categories { int CategoryID PK string CategoryName string Description blob Picture } CustomerCustomerDemo { string CustomerID PK, FK string CustomerTypeID PK, FK } CustomerDemographics { string CustomerTypeID PK string CustomerDesc } Customers { string CustomerID PK string CompanyName string ContactName string ContactTitle string Address string City string Region string PostalCode string Country string Phone string Fax } Employees { int EmployeeID PK string LastName string FirstName string Title string TitleOfCourtesy date BirthDate date HireDate string Address string City string Region string PostalCode string Country string HomePhone string Extension blob Photo string Notes int ReportsTo FK string PhotoPath } EmployeeTerritories { int EmployeeID PK, FK int TerritoryID PK, FK } "Order Details" { int OrderID PK, FK int ProductID PK, FK float UnitPrice int Quantity real Discount } Orders { int OrderID PK string CustomerID FK int EmployeeID FK datetime OrderDate datetime RequiredDate datetime ShippedDate int ShipVia FK numeric Freight string ShipName string ShipAddress string ShipCity string ShipRegion string ShipPostalCode string ShipCountry } Products { int ProductID PK string ProductName int SupplierID FK int CategoryID FK int QuantityPerUnit float UnitPrice int UnitsInStock int UnitsOnOrder int ReorderLevel string Discontinued } Regions { int RegionID PK string RegionDescription } Shippers { int ShipperID PK string CompanyName string Phone } Suppliers { int SupplierID PK string CompanyName string ContactName string ContactTitle string Address string City string Region string PostalCode string Country string Phone string Fax string HomePage } Territories { string TerritoryID PK string TerritoryDescription int RegionID FK }
import urllib.request
from pathlib import Path
db_file = Path('northwind.db')
if not db_file.exists():
urllib.request.urlretrieve(
'https://github.com/jpwhite3/northwind-SQLite3/raw/refs/heads/main/dist/northwind.db',
filename=db_file,
)
from architxt.database.loader import read_database
from sqlalchemy import create_engine
with create_engine('sqlite:///northwind.db').connect() as connection:
northwind_forest = list(read_database(connection, sample=20))
/home/runner/.cache/pypoetry/virtualenvs/architxt-r2jGmZOI-py3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
CustomerCustomerDemo: 0it [00:00, ?it/s]
CustomerCustomerDemo: 0it [00:00, ?it/s]
EmployeeTerritories: 0it [00:00, ?it/s]
EmployeeTerritories: 20it [00:00, 609.61it/s]
Employees: 0it [00:00, ?it/s]
Employees: 9it [00:00, 970.40it/s]
Order Details: 0it [00:00, ?it/s]
Order Details: 20it [00:00, 259.83it/s]
Chinook#
The Chinook Sample Database is a fictional digital media store database used for learning and testing SQL queries. It is modeled after real-world online music stores like iTunes and includes data related to artists, albums, tracks, customers, invoices, employees, and genres.
--- config: theme: neutral --- erDiagram Artist ||--|{ Album : "hasAuthor" Employee ||--|{ Customer : "supports" Employee ||--|{ Employee : "reportsTo" Customer ||--|{ Invoice : "makeOrder" Invoice ||--|{ InvoiceLine : "contains" Track ||--|{ InvoiceLine : "soldIn" Playlist ||--|{ PlaylistTrack : "includes" Track ||--|{ PlaylistTrack : "appearsIn" Album ||--|{ Track : "has" Genre ||--|{ Track : "hasGenre" MediaType ||--|{ Track : "encodedAs" Album { int AlbumId PK string Title int ArtistId FK } Artist { int ArtistId PK string Name } Customer { int CustomerId PK string FirstName string LastName string Company string Address string City string State string Country string PostalCode string Phone string Fax string Email int SupportRepId FK } Employee { int EmployeeId PK string LastName string FirstName string Title int ReportsTo FK date BirthDate date HireDate string Address string City string State string Country string PostalCode string Phone string Fax string Email } Genre { int GenreId PK string Name } Invoice { int InvoiceId PK int CustomerId FK date InvoiceDate string BillingAddress string BillingCity string BillingState string BillingCountry string BillingPostalCode float Total } InvoiceLine { int InvoiceLineId PK int InvoiceId FK int TrackId FK float UnitPrice int Quantity } MediaType { int MediaTypeId PK string Name } Playlist { int PlaylistId PK string Name } PlaylistTrack { int PlaylistId PK, FK int TrackId PK, FK } Track { int TrackId PK string Name int AlbumId FK int MediaTypeId FK int GenreId FK string Composer int Milliseconds int Bytes float UnitPrice }
db_file = Path('chinook.db')
if not db_file.exists():
urllib.request.urlretrieve(
'https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite',
filename=db_file,
)
with create_engine('sqlite:///chinook.db').connect() as connection:
chinook_forest = list(read_database(connection, sample=10))
Integration#
We combine the two forests representing the databases and inspect the schema.
from architxt.schema import Schema
forest = northwind_forest + chinook_forest
schema = Schema.from_forest(forest)
print(schema.as_cfg())
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType ROOT;
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType;
REL::EmployeeTerritories -> GROUP::Employees GROUP::Territories;
REL::Employees<->Employees -> GROUP::Employees GROUP::Employees;
REL::Territories<->Regions -> GROUP::Regions GROUP::Territories;
REL::OrderDetails<->Orders -> GROUP::OrderDetails GROUP::Orders;
REL::Orders<->Employees -> GROUP::Employees GROUP::Orders;
REL::Orders<->Shippers -> GROUP::Orders GROUP::Shippers;
REL::Orders<->Customers -> GROUP::Customers GROUP::Orders;
REL::OrderDetails<->Products -> GROUP::OrderDetails GROUP::Products;
REL::Products<->Categories -> GROUP::Categories GROUP::Products;
REL::Products<->Suppliers -> GROUP::Products GROUP::Suppliers;
REL::PlaylistTrack -> GROUP::Playlist GROUP::Track;
REL::Track<->Genre -> GROUP::Genre GROUP::Track;
REL::Track<->Album -> GROUP::Album GROUP::Track;
REL::Album<->Artist -> GROUP::Album GROUP::Artist;
REL::Track<->MediaType -> GROUP::MediaType GROUP::Track;
REL::Customer<->Employee -> GROUP::Customer GROUP::Employee;
REL::Employee<->Employee -> GROUP::Employee GROUP::Employee;
REL::InvoiceLine<->Track -> GROUP::InvoiceLine GROUP::Track;
REL::InvoiceLine<->Invoice -> GROUP::Invoice GROUP::InvoiceLine;
REL::Invoice<->Customer -> GROUP::Customer GROUP::Invoice;
GROUP::Employees -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::EmployeeID ENT::Extension ENT::FirstName ENT::HireDate ENT::HomePhone ENT::LastName ENT::Notes ENT::Photo ENT::PhotoPath ENT::PostalCode ENT::Region ENT::ReportsTo ENT::Title ENT::TitleOfCourtesy;
GROUP::Territories -> ENT::RegionID ENT::TerritoryDescription ENT::TerritoryID;
GROUP::Regions -> ENT::RegionDescription ENT::RegionID;
GROUP::OrderDetails -> ENT::Discount ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Orders -> ENT::CustomerID ENT::EmployeeID ENT::Freight ENT::OrderDate ENT::OrderID ENT::RequiredDate ENT::ShipAddress ENT::ShipCity ENT::ShipCountry ENT::ShipName ENT::ShipPostalCode ENT::ShipRegion ENT::ShipVia ENT::ShippedDate;
GROUP::Shippers -> ENT::CompanyName ENT::Phone ENT::ShipperID;
GROUP::Customers -> ENT::Address ENT::City ENT::CompanyName ENT::ContactName ENT::ContactTitle ENT::Country ENT::CustomerID ENT::Fax ENT::Phone ENT::PostalCode ENT::Region;
GROUP::Products -> ENT::CategoryID ENT::Discontinued ENT::ProductID ENT::ProductName ENT::QuantityPerUnit ENT::ReorderLevel ENT::SupplierID ENT::UnitPrice ENT::UnitsInStock ENT::UnitsOnOrder;
GROUP::Categories -> ENT::CategoryID ENT::CategoryName ENT::Description ENT::Picture;
GROUP::Suppliers -> ENT::Address ENT::City ENT::CompanyName ENT::ContactName ENT::ContactTitle ENT::Country ENT::Fax ENT::HomePage ENT::Phone ENT::PostalCode ENT::Region ENT::SupplierID;
GROUP::Track -> ENT::AlbumId ENT::Bytes ENT::Composer ENT::GenreId ENT::MediaTypeId ENT::Milliseconds ENT::Name ENT::TrackId ENT::UnitPrice;
GROUP::Playlist -> ENT::Name ENT::PlaylistId;
GROUP::Genre -> ENT::GenreId ENT::Name;
GROUP::Album -> ENT::AlbumId ENT::ArtistId ENT::Title;
GROUP::Artist -> ENT::ArtistId ENT::Name;
GROUP::MediaType -> ENT::MediaTypeId ENT::Name;
GROUP::Customer -> ENT::Address ENT::City ENT::Company ENT::Country ENT::CustomerId ENT::Email ENT::Fax ENT::FirstName ENT::LastName ENT::Phone ENT::PostalCode ENT::State ENT::SupportRepId;
GROUP::Employee -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::Email ENT::EmployeeId ENT::Fax ENT::FirstName ENT::HireDate ENT::LastName ENT::Phone ENT::PostalCode ENT::ReportsTo ENT::State ENT::Title;
GROUP::InvoiceLine -> ENT::InvoiceId ENT::InvoiceLineId ENT::Quantity ENT::TrackId ENT::UnitPrice;
GROUP::Invoice -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerId ENT::InvoiceDate ENT::InvoiceId ENT::Total;
Explore the schema#
The schema is complex, and some groups like Orders
and Invoice
may be merged to simplify the instance.
We first need to rename semantically equivalent properties to make them uniform across both databases.
We’ll use a mapping dictionary to rename entities and standardize them.
from architxt.tree import NodeLabel, NodeType, has_type
entity_mapping = {
# Track / Product
'TrackId': 'ProductID',
# Invoice / Order
'InvoiceId': 'OrderID',
'InvoiceDate': 'OrderDate',
'ShipAddress': 'BillingAddress',
'ShipCity': 'BillingCity',
'ShipCountry': 'BillingCountry',
'ShipPostalCode': 'BillingPostalCode',
'BillingState': 'State',
'ShipRegion': 'Region',
# Employees / Customers
'CompanyName': 'Company',
'HomePhone': 'Phone',
'CustomerId': 'CustomerID',
'EmployeeId': 'EmployeeID',
}
for tree in forest:
for subtree in tree.subtrees():
if has_type(subtree, NodeType.ENT) and subtree.label().name in entity_mapping:
subtree.set_label(NodeLabel(NodeType.ENT, entity_mapping[subtree.label().name]))
schema = Schema.from_forest(forest)
print(schema.as_cfg())
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType ROOT;
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType;
REL::EmployeeTerritories -> GROUP::Employees GROUP::Territories;
REL::Employees<->Employees -> GROUP::Employees GROUP::Employees;
REL::Territories<->Regions -> GROUP::Regions GROUP::Territories;
REL::OrderDetails<->Orders -> GROUP::OrderDetails GROUP::Orders;
REL::Orders<->Employees -> GROUP::Employees GROUP::Orders;
REL::Orders<->Shippers -> GROUP::Orders GROUP::Shippers;
REL::Orders<->Customers -> GROUP::Customers GROUP::Orders;
REL::OrderDetails<->Products -> GROUP::OrderDetails GROUP::Products;
REL::Products<->Categories -> GROUP::Categories GROUP::Products;
REL::Products<->Suppliers -> GROUP::Products GROUP::Suppliers;
REL::PlaylistTrack -> GROUP::Playlist GROUP::Track;
REL::Track<->Genre -> GROUP::Genre GROUP::Track;
REL::Track<->Album -> GROUP::Album GROUP::Track;
REL::Album<->Artist -> GROUP::Album GROUP::Artist;
REL::Track<->MediaType -> GROUP::MediaType GROUP::Track;
REL::Customer<->Employee -> GROUP::Customer GROUP::Employee;
REL::Employee<->Employee -> GROUP::Employee GROUP::Employee;
REL::InvoiceLine<->Track -> GROUP::InvoiceLine GROUP::Track;
REL::InvoiceLine<->Invoice -> GROUP::Invoice GROUP::InvoiceLine;
REL::Invoice<->Customer -> GROUP::Customer GROUP::Invoice;
GROUP::Employees -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::EmployeeID ENT::Extension ENT::FirstName ENT::HireDate ENT::LastName ENT::Notes ENT::Phone ENT::Photo ENT::PhotoPath ENT::PostalCode ENT::Region ENT::ReportsTo ENT::Title ENT::TitleOfCourtesy;
GROUP::Territories -> ENT::RegionID ENT::TerritoryDescription ENT::TerritoryID;
GROUP::Regions -> ENT::RegionDescription ENT::RegionID;
GROUP::OrderDetails -> ENT::Discount ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Orders -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::EmployeeID ENT::Freight ENT::OrderDate ENT::OrderID ENT::Region ENT::RequiredDate ENT::ShipName ENT::ShipVia ENT::ShippedDate;
GROUP::Shippers -> ENT::Company ENT::Phone ENT::ShipperID;
GROUP::Customers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::CustomerID ENT::Fax ENT::Phone ENT::PostalCode ENT::Region;
GROUP::Products -> ENT::CategoryID ENT::Discontinued ENT::ProductID ENT::ProductName ENT::QuantityPerUnit ENT::ReorderLevel ENT::SupplierID ENT::UnitPrice ENT::UnitsInStock ENT::UnitsOnOrder;
GROUP::Categories -> ENT::CategoryID ENT::CategoryName ENT::Description ENT::Picture;
GROUP::Suppliers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::Fax ENT::HomePage ENT::Phone ENT::PostalCode ENT::Region ENT::SupplierID;
GROUP::Track -> ENT::AlbumId ENT::Bytes ENT::Composer ENT::GenreId ENT::MediaTypeId ENT::Milliseconds ENT::Name ENT::ProductID ENT::UnitPrice;
GROUP::Playlist -> ENT::Name ENT::PlaylistId;
GROUP::Genre -> ENT::GenreId ENT::Name;
GROUP::Album -> ENT::AlbumId ENT::ArtistId ENT::Title;
GROUP::Artist -> ENT::ArtistId ENT::Name;
GROUP::MediaType -> ENT::MediaTypeId ENT::Name;
GROUP::Customer -> ENT::Address ENT::City ENT::Company ENT::Country ENT::CustomerID ENT::Email ENT::Fax ENT::FirstName ENT::LastName ENT::Phone ENT::PostalCode ENT::State ENT::SupportRepId;
GROUP::Employee -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::Email ENT::EmployeeID ENT::Fax ENT::FirstName ENT::HireDate ENT::LastName ENT::Phone ENT::PostalCode ENT::ReportsTo ENT::State ENT::Title;
GROUP::InvoiceLine -> ENT::InvoiceLineId ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Invoice -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::OrderDate ENT::OrderID ENT::Total;
Now that we’ve standardized the property names, we can visualize the initial distribution of groups in the schema.
To do this, let’s count how many instances of each group exist in the schema.
from collections import Counter
import pandas as pd
from architxt.tree import NodeType, has_type
group_counter = Counter(
subtree.label().name for tree in forest for subtree in tree.subtrees() if has_type(subtree, NodeType.GROUP)
)
df_initial = pd.DataFrame.from_dict(group_counter, orient='index', columns=['Count'])
df_initial = df_initial.sort_values('Count', ascending=False)
df_initial.loc['Total'] = group_counter.total()
df_initial.index.name = 'Group'
df_initial
Count | |
---|---|
Group | |
Loading ITables v2.3.0 from the internet... (need help?) |
Next, we can use ArchiTXT’s tree clustering algorithm to identify which groups can be merged.
from architxt.similarity import equiv_cluster
clusters = equiv_cluster(forest, tau=0.85)
similarity: 0%| | 0.00/797k [00:00<?, ?it/s]
similarity: 3%|▎ | 21.6k/797k [00:00<00:03, 216kit/s]
similarity: 7%|▋ | 56.8k/797k [00:00<00:02, 296kit/s]
similarity: 11%|█ | 88.2k/797k [00:00<00:02, 304kit/s]
similarity: 15%|█▍ | 119k/797k [00:00<00:02, 305kit/s]
similarity: 19%|█▊ | 149k/797k [00:00<00:02, 300kit/s]
similarity: 22%|██▏ | 179k/797k [00:00<00:02, 289kit/s]
similarity: 26%|██▌ | 209k/797k [00:00<00:02, 292kit/s]
similarity: 30%|██▉ | 238k/797k [00:00<00:01, 287kit/s]
similarity: 34%|███▎ | 267k/797k [00:00<00:02, 254kit/s]
similarity: 37%|███▋ | 293k/797k [00:01<00:02, 234kit/s]
similarity: 40%|███▉ | 317k/797k [00:01<00:02, 223kit/s]
similarity: 43%|████▎ | 340k/797k [00:01<00:02, 212kit/s]
similarity: 45%|████▌ | 361k/797k [00:01<00:02, 206kit/s]
similarity: 48%|████▊ | 382k/797k [00:01<00:02, 204kit/s]
similarity: 51%|█████ | 403k/797k [00:01<00:01, 203kit/s]
similarity: 53%|█████▎ | 423k/797k [00:01<00:01, 202kit/s]
similarity: 56%|█████▌ | 443k/797k [00:01<00:01, 202kit/s]
similarity: 58%|█████▊ | 464k/797k [00:01<00:01, 202kit/s]
similarity: 61%|██████ | 484k/797k [00:02<00:01, 203kit/s]
similarity: 63%|██████▎ | 505k/797k [00:02<00:01, 203kit/s]
similarity: 66%|██████▌ | 525k/797k [00:02<00:01, 203kit/s]
similarity: 68%|██████▊ | 545k/797k [00:02<00:01, 203kit/s]
similarity: 71%|███████ | 566k/797k [00:02<00:01, 204kit/s]
similarity: 74%|███████▎ | 586k/797k [00:02<00:01, 205kit/s]
similarity: 76%|███████▋ | 609k/797k [00:02<00:00, 210kit/s]
similarity: 79%|███████▉ | 630k/797k [00:02<00:00, 210kit/s]
similarity: 82%|████████▏ | 652k/797k [00:02<00:00, 215kit/s]
similarity: 85%|████████▍ | 675k/797k [00:02<00:00, 220kit/s]
similarity: 88%|████████▊ | 700k/797k [00:03<00:00, 228kit/s]
similarity: 91%|█████████ | 726k/797k [00:03<00:00, 236kit/s]
similarity: 95%|█████████▍| 754k/797k [00:03<00:00, 250kit/s]
similarity: 98%|█████████▊| 780k/797k [00:03<00:00, 252kit/s]
Let’s visualize the clustering result in a new table.
new_groups = {
str(klass[0].label()): Counter(subtree.label().name for subtree in klass if has_type(subtree, NodeType.GROUP))
for klass in clusters
}
df_new = pd.DataFrame.from_dict(new_groups, orient='index').fillna(0).astype(int)
df_new['Total'] = df_new.sum(axis=1)
df_new = df_new.sort_values('Total', ascending=False)
df_initial.index.name = 'Class'
df_new
Suppliers | Customers | MediaType | Genre | Shippers | Employees | Employee | Customer | Playlist | Categories | Territories | OrderDetails | InvoiceLine | Artist | Products | Track | Album | Orders | Invoice | Regions | Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the internet... (need help?) |
Let’s visualize the clustering result as a bar chart to better understand the distribution of groups across equivalent classes.
import plotly.express as px
fig = px.bar(
df_new.drop(columns=['Total']),
labels={'value': 'Count', 'variable': 'Group', 'index': 'Class'},
title='Groups repartition per Equivalent Class',
)
fig.update_layout(barmode='stack', xaxis_title='Equivalent Class', yaxis_title='Count', xaxis_tickangle=-45)
Simplification#
Now that we saw potential groupings, we can use ArchiTXT to perform simplification and rewrite the forest into a condensed version.
from architxt.simplification.tree_rewriting import rewrite
new_forest = rewrite(forest, tau=0.85, min_support=20, epoch=5, edit_ops=[])
new_schema = Schema.from_forest(new_forest)
print(new_schema.as_cfg())
ROOT -> COLL::Employees GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType ROOT;
ROOT -> GROUP::Album GROUP::Artist GROUP::Categories GROUP::Customer GROUP::Customers GROUP::Employee GROUP::Employees GROUP::Genre GROUP::Invoice GROUP::InvoiceLine GROUP::MediaType GROUP::OrderDetails GROUP::Orders GROUP::Playlist GROUP::Products GROUP::Regions GROUP::Shippers GROUP::Suppliers GROUP::Territories GROUP::Track REL::Album<->Artist REL::Customer<->Employee REL::Employee<->Employee REL::EmployeeTerritories REL::Employees<->Employees REL::Invoice<->Customer REL::InvoiceLine<->Invoice REL::InvoiceLine<->Track REL::OrderDetails<->Orders REL::OrderDetails<->Products REL::Orders<->Customers REL::Orders<->Employees REL::Orders<->Shippers REL::PlaylistTrack REL::Products<->Categories REL::Products<->Suppliers REL::Territories<->Regions REL::Track<->Album REL::Track<->Genre REL::Track<->MediaType;
COLL::Employees -> GROUP::Employees;
REL::OrderDetails<->Orders -> GROUP::OrderDetails GROUP::Orders;
REL::Orders<->Employees -> GROUP::Employees GROUP::Orders;
REL::Employees<->Employees -> GROUP::Employees GROUP::Employees;
REL::Orders<->Shippers -> GROUP::Orders GROUP::Shippers;
REL::Orders<->Customers -> GROUP::Customers GROUP::Orders;
REL::OrderDetails<->Products -> GROUP::OrderDetails GROUP::Products;
REL::Products<->Categories -> GROUP::Categories GROUP::Products;
REL::Products<->Suppliers -> GROUP::Products GROUP::Suppliers;
REL::InvoiceLine<->Track -> GROUP::InvoiceLine GROUP::Track;
REL::Track<->Genre -> GROUP::Genre GROUP::Track;
REL::Track<->Album -> GROUP::Album GROUP::Track;
REL::Album<->Artist -> GROUP::Album GROUP::Artist;
REL::Track<->MediaType -> GROUP::MediaType GROUP::Track;
REL::InvoiceLine<->Invoice -> GROUP::Invoice GROUP::InvoiceLine;
REL::Invoice<->Customer -> GROUP::Customer GROUP::Invoice;
REL::Customer<->Employee -> GROUP::Customer GROUP::Employee;
REL::Employee<->Employee -> GROUP::Employee GROUP::Employee;
REL::EmployeeTerritories -> GROUP::Employees GROUP::Territories;
REL::Territories<->Regions -> GROUP::Regions GROUP::Territories;
REL::PlaylistTrack -> GROUP::Playlist GROUP::Track;
GROUP::OrderDetails -> ENT::Discount ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Orders -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::EmployeeID ENT::Freight ENT::OrderDate ENT::OrderID ENT::Region ENT::RequiredDate ENT::ShipName ENT::ShipVia ENT::ShippedDate;
GROUP::Employees -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::EmployeeID ENT::Extension ENT::FirstName ENT::HireDate ENT::LastName ENT::Notes ENT::Phone ENT::Photo ENT::PhotoPath ENT::PostalCode ENT::Region ENT::ReportsTo ENT::Title ENT::TitleOfCourtesy;
GROUP::Shippers -> ENT::Company ENT::Phone ENT::ShipperID;
GROUP::Customers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::CustomerID ENT::Fax ENT::Phone ENT::PostalCode ENT::Region;
GROUP::Products -> ENT::CategoryID ENT::Discontinued ENT::ProductID ENT::ProductName ENT::QuantityPerUnit ENT::ReorderLevel ENT::SupplierID ENT::UnitPrice ENT::UnitsInStock ENT::UnitsOnOrder;
GROUP::Categories -> ENT::CategoryID ENT::CategoryName ENT::Description ENT::Picture;
GROUP::Suppliers -> ENT::Address ENT::City ENT::Company ENT::ContactName ENT::ContactTitle ENT::Country ENT::Fax ENT::HomePage ENT::Phone ENT::PostalCode ENT::Region ENT::SupplierID;
GROUP::InvoiceLine -> ENT::InvoiceLineId ENT::OrderID ENT::ProductID ENT::Quantity ENT::UnitPrice;
GROUP::Track -> ENT::AlbumId ENT::Bytes ENT::Composer ENT::GenreId ENT::MediaTypeId ENT::Milliseconds ENT::Name ENT::ProductID ENT::UnitPrice;
GROUP::Genre -> ENT::GenreId ENT::Name;
GROUP::Album -> ENT::AlbumId ENT::ArtistId ENT::Title;
GROUP::Artist -> ENT::ArtistId ENT::Name;
GROUP::MediaType -> ENT::MediaTypeId ENT::Name;
GROUP::Invoice -> ENT::BillingAddress ENT::BillingCity ENT::BillingCountry ENT::BillingPostalCode ENT::CustomerID ENT::OrderDate ENT::OrderID ENT::Total;
GROUP::Customer -> ENT::Address ENT::City ENT::Company ENT::Country ENT::CustomerID ENT::Email ENT::Fax ENT::FirstName ENT::LastName ENT::Phone ENT::PostalCode ENT::State ENT::SupportRepId;
GROUP::Employee -> ENT::Address ENT::BirthDate ENT::City ENT::Country ENT::Email ENT::EmployeeID ENT::Fax ENT::FirstName ENT::HireDate ENT::LastName ENT::Phone ENT::PostalCode ENT::ReportsTo ENT::State ENT::Title;
GROUP::Territories -> ENT::RegionID ENT::TerritoryDescription ENT::TerritoryID;
GROUP::Regions -> ENT::RegionDescription ENT::RegionID;
GROUP::Playlist -> ENT::Name ENT::PlaylistId;
datasets = new_schema.extract_datasets(new_forest)
group = set(datasets.keys()).pop()
datasets[group]
CategoryID | CategoryName | Description | Picture | |
---|---|---|---|---|
Loading ITables v2.3.0 from the internet... (need help?) |