Exploring a Textual Corpus with ArchiTXT#

This tutorial provides a step-by-step guide on how to use ArchiTXT to efficiently process and analyze textual corpora.

ArchiTXT allows loading a corpus as a set of syntax trees, where each tree is enriched by incorporating named entities. These enriched trees form a forest, which can then be automatically structured into a valid database instance for further analysis.

By following this tutorial, you’ll learn how to:

  • Load a corpus

  • Parse textual data with Berkeley Neural Parser (Benepar)

  • Extract structured data using ArchiTXT

Downloading the MACCROBAT Corpus#

The MACCROBAT corpus is a collection of 200 annotated medical documents, specifically clinical case reports, extracted from PubMed Central. The annotations focus on key medical concepts such as diseases, treatments, medications, and symptoms, making it a valuable resource for biomedical text analysis.

The MACCROBAT corpus is available for download at Figshare.

Let’s download the corpora.

import io
import urllib.request
import zipfile

with urllib.request.urlopen('https://figshare.com/ndownloader/articles/9764942/versions/2') as response:
    archive_file = io.BytesIO(response.read())

with zipfile.ZipFile(archive_file) as archive:
    archive.extract('MACCROBAT2020.zip')

Installing and Configuring NLP Models#

ArchiTXT can parse the sentences using either Benepar with SpaCy or a CoreNLP server. In this tutorial, we will use the SpaCy parser with the default model, but you can use any models like one from SciSpaCy, a collection of models designed for biomedical text processing by AllenAI.

To download the SciSpaCy model, do:

!spacy download en_core_web_sm

We also need to download the Benepar model for English

import benepar

benepar.download('benepar_en3')

Parsing the Corpus with ArchiTXT#

Before processing the corpus, we need to configure the BeneparParser, specifying which SpaCy model to use for each language.

import warnings

from architxt.nlp.parser.benepar import BeneparParser

# Initialize the parser
parser = BeneparParser(
    spacy_models={
        'English': 'en_core_web_sm',
    }
)

# Suppress warnings for unsupported annotations
warnings.filterwarnings("ignore")

To ensure everything is working correctly, we first parse a small set of sentences from the corpus.

from architxt.nlp import raw_load_corpus

forest = await raw_load_corpus(
    ['MACCROBAT2020.zip'],
    ['English'],
    parser=parser,
    sample=10,
    cache=False,
)

We can look at our enriched tree using the pretty_print method.

# Look at the highest tree
max(forest, key=lambda tree: tree.height()).pretty_print()
                                                                                                                          ROOT                                                                                                                                                                                                                                                                                                                             
                                                ___________________________________________________________________________|_____________________________________________________________________________________                                                                                                                                                                                                                                           
                                               |                                                                   |                                                                                      UNDEF_7fc008201f                                                                                                                                                                                                                                 
                                               |                                                                   |                                                                                      624df494ee2d68d8                                                                                                                                                                                                                                 
                                               |                                                                   |                                                                                           dd27d4                                                                                                                                                                                                                                      
                                               |                                                                   |                   __________________________________________________________________________|___________________________________________________________________________________________________________________________________                                                                                                       
                                               |                                                                   |                  |                                                                                                                                                                                                       UNDEF_7245682256                                                                                             
                                               |                                                                   |                  |                                                                                                                                                                                                       014f7182f946119b                                                                                             
                                               |                                                                   |                  |                                                                                                                                                                                                            fea938                                                                                                  
                                               |                                                                   |                  |                                                                           ___________________________________________________________________________________________________________________________________|______________________________________________________                                                
                                               |                                                                   |                  |                                                                   UNDEF_d0fddb9f77                                                                                                                                                                                  |                                              
                                               |                                                                   |                  |                                                                   9e4dd2a06644e56f                                                                                                                                                                                  |                                              
                                               |                                                                   |                  |                                                                        108c50                                                                                                                                                                                       |                                              
                                               |                                                                   |                  |                                         _________________________________|_________________________________________________________________________________________                                                                                                 |                                               
                                        UNDEF_145b57abc9                                                           |                  |                                 UNDEF_b6db25c0d0                                                                                                                   |                                                                                                |                                              
                                        5f455b8d5936b277                                                           |                  |                                 274c179e29bf80c7                                                                                                                   |                                                                                                |                                              
                                             f8d835                                                                |                  |                                      d7ec96                                                                                                                        |                                                                                                |                                              
                  _____________________________|_________________________________                                  |                  |                          ______________|_________________________________                                                                                          |                                                                                                |                                               
                 |                                                        UNDEF_9bcc29e037                         |                  |                         |                                         UNDEF_4e627a8f45                                                                          UNDEF_52fe2c9cb5                                                                                 UNDEF_f7d6f41748                                      
                 |                                                        044498867d3cdb14                         |                  |                         |                                         784abca1689f23de                                                                          54436bbe53192f0a                                                                                 ea4d42991a727482                                      
                 |                                                             771392                              |                  |                         |                                              8b4b37                                                                                    2da7d4                                                                                           22ff5f                                           
                 |                                               ________________|________________                 |                  |                         |                                ________________|________________                                        _________________________________|____________________                                                              ______________|_______________________________________        
            ENT::HISTORY                                 ENT::DETAILED_DE                  ENT::DISEASE_DIS     ENT::AGE       ENT::SIGN_SYMPTO            ENT::HISTORY                  ENT::DETAILED_DE                  ENT::DISEASE_DIS                       ENT::DIAGNOSTIC_                                       ENT::DIAGNOSTIC_                                               ENT::HISTORY                                          ENT::SEVERITY
                 |                                          SCRIPTION                           ORDER              |                  M                         |                           SCRIPTION                           ORDER                                PROCEDURE                                              PROCEDURE                                                        |                                                      |      
      ___________|_____________                                 |                                 |           _____|_______           |              ___________|______________                 |                 ________________|____________            ______________|________________                             _________|__________           _______________________________________|_________________________________________             |       
nonischemic              cardiomyopathy                    nonischemic                      cardiomyopathy  age    10    years     symptoms     congestive    heart         failure         congestive         heart                        failure transthoracic                  echocardiography                   (                   TTE      severe      left ventricular  (   LV      )         dysfunction     (   LV  )  dysfunction     severe

Named Entity Resolution (NER) helps to standardize the named entities and to build a database instance. To enable NER, we need to provide the knowledge base to use. For this tutorial, we will use the UMLS (Unified Medical Language System) resolver.

Let’s now parse more sentences from the corpora.

forest = await raw_load_corpus(
    ['MACCROBAT2020.zip'],
    ['English'],
    parser=parser,
    sample=800,
    cache=False,
    resolver_name='umls',
)
# Look at the highest tree
max(forest, key=lambda tree: tree.height()).pretty_print()
                                                        ROOT                                                                                                                                                                                                                                 
        _________________________________________________|__________________________________________________________________                                                                                                                                                                  
       |                                                                                                             UNDEF_558e5aad26                                                                                                                                                        
       |                                                                                                             904516a2182d927b                                                                                                                                                        
       |                                                                                                                  6c866e                                                                                                                                                             
       |                  __________________________________________________________________________________________________|________________                                                                                                                                                 
       |                 |                                                                                                            UNDEF_5ad54bddfd                                                                                                                                       
       |                 |                                                                                                            1340888a3f20e369                                                                                                                                       
       |                 |                                                                                                                 67a1a9                                                                                                                                            
       |                 |                                ___________________________________________________________________________________|______________________________________________________________________________________________________________________________________          
       |                 |                        UNDEF_a21d216a5a                                                                                                                                                                                                 |                |        
       |                 |                        364c2999b1fe80c5                                                                                                                                                                                                 |                |        
       |                 |                             f8baef                                                                                                                                                                                                      |                |        
       |                 |                _______________|__________________________________                                                                                                                                                                       |                |         
       |                 |               |               |                           UNDEF_9d933b0a18                                                                                                                                                              |                |        
       |                 |               |               |                           f94e468e906265f3                                                                                                                                                              |                |        
       |                 |               |               |                                a62b75                                                                                                                                                                   |                |        
       |                 |               |               |                  ________________|_______________________________                                                                                                                                       |                |         
       |                 |               |               |                 |                |                        UNDEF_338ddec0b1                                                                                                                              |                |        
       |                 |               |               |                 |                |                        214bd3abe362eeac                                                                                                                              |                |        
       |                 |               |               |                 |                |                             6b6405                                                                                                                                   |                |        
       |                 |               |               |                 |                |                _______________|_________________________________                                                                                                     |                |         
       |                 |               |               |                 |                |               |                                          UNDEF_114b854648                                                                                            |                |        
       |                 |               |               |                 |                |               |                                          a94925a9e11789a4                                                                                            |                |        
       |                 |               |               |                 |                |               |                                               9ee1c3                                                                                                 |                |        
       |                 |               |               |                 |                |               |                _________________________________|________________                                                                                    |                |         
       |                 |               |               |                 |                |               |               |                                           UNDEF_4491e39e1f                                                                           |                |        
       |                 |               |               |                 |                |               |               |                                           0a4c019cd2725c99                                                                           |                |        
       |                 |               |               |                 |                |               |               |                                                e22d29                                                                                |                |        
       |                 |               |               |                 |                |               |               |                 _________________________________|________________                                                                   |                |         
       |                 |               |               |                 |                |               |               |                |                                           UNDEF_05541df006                                                          |                |        
       |                 |               |               |                 |                |               |               |                |                                           2740829f4b6b52f8                                                          |                |        
       |                 |               |               |                 |                |               |               |                |                                                642793                                                               |                |        
       |                 |               |               |                 |                |               |               |                |                 _________________________________|________________                                                  |                |         
       |                 |               |               |                 |                |               |               |                |                |                                           UNDEF_6ce6501a84                                         |                |        
       |                 |               |               |                 |                |               |               |                |                |                                           4a42c99f441c878f                                         |                |        
       |                 |               |               |                 |                |               |               |                |                |                                                0ac1eb                                              |                |        
       |                 |               |               |                 |                |               |               |                |                |                 _________________________________|________________                                 |                |         
       |                 |               |               |                 |                |               |               |                |                |                |                                           UNDEF_baf0f548cd                        |                |        
       |                 |               |               |                 |                |               |               |                |                |                |                                           6441e7ba8c64a708                        |                |        
       |                 |               |               |                 |                |               |               |                |                |                |                                                3e2b55                             |                |        
       |                 |               |               |                 |                |               |               |                |                |                |                                  ________________|________________                |                |         
ENT::DIAGNOSTIC_  ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_   ENT::LAB_VALUE  ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_  ENT::LAB_VALUE  ENT::DIAGNOSTIC_  ENT::LAB_VALUE                   ENT::DIAGNOSTIC_                   ENT::LAB_VALUE ENT::DIAGNOSTIC_  ENT::LAB_VALUE 
   PROCEDURE         PROCEDURE           |           PROCEDURE             |            PROCEDURE           |           PROCEDURE            |            PROCEDURE            |                             PROCEDURE                             |           PROCEDURE            |        
       |                 |               |               |                 |                |               |               |                |                |                |                                 |                                 |               |                |         
  biochemical    cardiac troponin   was 0.714 mg     nesiritide    what 466,530 pg /  ), serum d -     was 8.14 mg   blood urea nitro  was 11.69 mmol  creatinine measu was 144.00 mmol                     serum urate                     was 611.40 mmol and endogenous   what 57.90 ml / 
                    measurement                                                           dimer                      gen measurement                   rement, serum (                                      measurement                                     creatinine clear       min       
                                                                                                                                                          procedure)                                                                                              ing

ArchiTXT can then automatically structure parsed text into a database-friendly format.

from architxt.simplification.tree_rewriting import rewrite

new_forest = rewrite(forest, epoch=20, min_support=10, tau=0.8)
# Look at the highest tree
max(new_forest, key=lambda tree: tree.height()).pretty_print()
                                                                                                                                                              ROOT                                                                                                                                                                 
                                                _______________________________________________________________________________________________________________|___________________________________________________                                                                                                                 
                                               |                                                  |                                                                                                         UNDEF_c8ba4b30f2                                                                                                       
                                               |                                                  |                                                                                                         6c4f158914c60bf2                                                                                                       
                                               |                                                  |                                                                                                              764d02                                                                                                            
                                               |                                                  |                               _________________________________________________________________________________|_______________________________________________________________________________________________________         
                                            COLL::0                                               |                              |                                                             COLL::0                                                                   |                |                |               |       
                  _____________________________|________________                                  |                              |                                                ________________|_________________________________                                     |                |                |               |        
              GROUP::0                                       GROUP::0                             |                           GROUP::0                                        GROUP::0                                           GROUP::0                                |                |                |               |       
       __________|____________                  ________________|________________                 |                  ____________|____________                  _________________|________________                  ________________|_________________                   |                |                |               |        
ENT::DURATION          ENT::CLINICAL_EV ENT::CLINICAL_EV ENT::NONBIOLOGIC ENT::SIGN_SYMPTO ENT::DIAGNOSTIC_   ENT::SEVERITY            ENT::SIGN_SYMPTO ENT::SIGN_SYMPTO  ENT::BIOLOGICAL_  ENT::LAB_VALUE  ENT::DIAGNOSTIC_ ENT::SIGN_SYMPTO   ENT::LAB_VALUE    ENT::DIAGNOSTIC_ ENT::DIAGNOSTIC_  ENT::LAB_VALUE  ENT::LAB_VALUE
      |                      ENT              ENT          AL_LOCATION           M            PROCEDURE             |                         M                M             STRUCTURE            |            PROCEDURE            M                 |              PROCEDURE        PROCEDURE            |               |       
      |                       |                |                |                |                |                 |                         |                |                 |                |                |                |                 |                  |                |                |               |        
     year              follow-up status   presentation      hospitals          fever       echocardiography severe (severity            aortic valve        rupture      one of the cusps  worsening patter cardiac contract ectasia of left  ( telediastolic /  left ventricular       ( lvef      mental depressio  ( 40 - 45 %  
                                                                                                                modifier)               insufficiency                     of the homograft        n         ility alteration atrial appendage   telesystolic      ejection fractio                         n                       
                                                                                                                                                                                                                                               diameters 60/42           n                                                         
                                                                                                                                                                                                                                                      mm

Now that we have a structured instance, we can extract its schema. The schema provides a formal representation of the extracted data.

from architxt.schema import Schema

schema = Schema.from_forest(new_forest, keep_unlabelled=False)
print(schema.as_cfg())
ROOT -> COLL::0 GROUP::0;
COLL::0 -> GROUP::0;
GROUP::0 -> ENT::ACTIVITY ENT::ADMINISTRATION ENT::AGE ENT::BIOLOGICAL_ATTRIBUTE ENT::BIOLOGICAL_STRUCTURE ENT::CLINICAL_EVENT ENT::COLOR ENT::COREFERENCE ENT::DATE ENT::DETAILED_DESCRIPTION ENT::DIAGNOSTIC_PROCEDURE ENT::DISEASE_DISORDER ENT::DISTANCE ENT::DOSAGE ENT::DURATION ENT::FAMILY_HISTORY ENT::FREQUENCY ENT::HISTORY ENT::LAB_VALUE ENT::MASS ENT::MEDICATION ENT::NONBIOLOGICAL_LOCATION ENT::OCCUPATION ENT::OTHER_EVENT ENT::OUTCOME ENT::PERSONAL_BACKGROUND ENT::QUALITATIVE_CONCEPT ENT::QUANTITATIVE_CONCEPT ENT::SEVERITY ENT::SEX ENT::SHAPE ENT::SIGN_SYMPTOM ENT::SUBJECT ENT::TEXTURE ENT::THERAPEUTIC_PROCEDURE ENT::TIME ENT::VOLUME;

Not all extracted trees contribute to meaningful insights. We can filter our structured instance to retain only valid trees:

cleaned_forest = schema.extract_valid_trees(new_forest)

Now that we have a structured dataset, we can explore the different semantic groups. Groups represent common patterns across the corpus.

datasets = schema.extract_datasets(new_forest)
group = set(datasets.keys()).pop()

datasets[group]
DISEASE_DISORDER SIGN_SYMPTOM SEVERITY TEXTURE BIOLOGICAL_STRUCTURE AGE SEX PERSONAL_BACKGROUND HISTORY ACTIVITY CLINICAL_EVENT NONBIOLOGICAL_LOCATION DATE DETAILED_DESCRIPTION LAB_VALUE DURATION DISTANCE DIAGNOSTIC_PROCEDURE BIOLOGICAL_ATTRIBUTE COLOR MEDICATION DOSAGE ADMINISTRATION THERAPEUTIC_PROCEDURE FREQUENCY SHAPE COREFERENCE QUANTITATIVE_CONCEPT FAMILY_HISTORY SUBJECT MASS QUALITATIVE_CONCEPT OUTCOME TIME OCCUPATION VOLUME OTHER_EVENT
Loading ITables v2.3.0 from the internet... (need help?)