Exploring a Textual Corpus with ArchiTXT#
This tutorial provides a step-by-step guide on how to use ArchiTXT to efficiently process and analyze textual corpora.
ArchiTXT allows loading a corpus as a set of syntax trees, where each tree is enriched by incorporating named entities. These enriched trees form a forest, which can then be automatically structured into a valid database instance for further analysis.
By following this tutorial, you’ll learn how to:
Load a corpus
Parse textual data with Berkeley Neural Parser (Benepar)
Extract structured data using ArchiTXT
Downloading the MACCROBAT Corpus#
The MACCROBAT corpus is a collection of 200 annotated medical documents, specifically clinical case reports, extracted from PubMed Central. The annotations focus on key medical concepts such as diseases, treatments, medications, and symptoms, making it a valuable resource for biomedical text analysis.
The MACCROBAT corpus is available for download at Figshare.
Let’s download the corpora.
import io
import urllib.request
import zipfile
with urllib.request.urlopen('https://figshare.com/ndownloader/articles/9764942/versions/2') as response:
archive_file = io.BytesIO(response.read())
with zipfile.ZipFile(archive_file) as archive:
archive.extract('MACCROBAT2020.zip')
Installing and Configuring NLP Models#
ArchiTXT can parse the sentences using either Benepar with SpaCy or a CoreNLP server. In this tutorial, we will use the SpaCy parser with the default model, but you can use any models like one from SciSpaCy, a collection of models designed for biomedical text processing by AllenAI.
To download the SciSpaCy model, do:
!spacy download en_core_web_sm
We also need to download the Benepar model for English
import benepar
benepar.download('benepar_en3')
Parsing the Corpus with ArchiTXT#
Before processing the corpus, we need to configure the BeneparParser, specifying which SpaCy model to use for each language.
import warnings
from architxt.nlp.parser.benepar import BeneparParser
# Initialize the parser
parser = BeneparParser(
spacy_models={
'English': 'en_core_web_sm',
}
)
# Suppress warnings for unsupported annotations
warnings.filterwarnings("ignore")
To ensure everything is working correctly, we first parse a small set of sentences from the corpus.
from architxt.nlp import raw_load_corpus
forest = await raw_load_corpus(
['MACCROBAT2020.zip'],
['English'],
parser=parser,
sample=10,
cache=False,
)
We can look at our enriched tree using the pretty_print
method.
# Look at the highest tree
max(forest, key=lambda tree: tree.height()).pretty_print()
ROOT
___________________________________________________________________________|_____________________________________________________________________________________
| | UNDEF_7fc008201f
| | 624df494ee2d68d8
| | dd27d4
| | __________________________________________________________________________|___________________________________________________________________________________________________________________________________
| | | UNDEF_7245682256
| | | 014f7182f946119b
| | | fea938
| | | ___________________________________________________________________________________________________________________________________|______________________________________________________
| | | UNDEF_d0fddb9f77 |
| | | 9e4dd2a06644e56f |
| | | 108c50 |
| | | _________________________________|_________________________________________________________________________________________ |
UNDEF_145b57abc9 | | UNDEF_b6db25c0d0 | |
5f455b8d5936b277 | | 274c179e29bf80c7 | |
f8d835 | | d7ec96 | |
_____________________________|_________________________________ | | ______________|_________________________________ | |
| UNDEF_9bcc29e037 | | | UNDEF_4e627a8f45 UNDEF_52fe2c9cb5 UNDEF_f7d6f41748
| 044498867d3cdb14 | | | 784abca1689f23de 54436bbe53192f0a ea4d42991a727482
| 771392 | | | 8b4b37 2da7d4 22ff5f
| ________________|________________ | | | ________________|________________ _________________________________|____________________ ______________|_______________________________________
ENT::HISTORY ENT::DETAILED_DE ENT::DISEASE_DIS ENT::AGE ENT::SIGN_SYMPTO ENT::HISTORY ENT::DETAILED_DE ENT::DISEASE_DIS ENT::DIAGNOSTIC_ ENT::DIAGNOSTIC_ ENT::HISTORY ENT::SEVERITY
| SCRIPTION ORDER | M | SCRIPTION ORDER PROCEDURE PROCEDURE | |
___________|_____________ | | _____|_______ | ___________|______________ | ________________|____________ ______________|________________ _________|__________ _______________________________________|_________________________________________ |
nonischemic cardiomyopathy nonischemic cardiomyopathy age 10 years symptoms congestive heart failure congestive heart failure transthoracic echocardiography ( TTE severe left ventricular ( LV ) dysfunction ( LV ) dysfunction severe
Named Entity Resolution (NER) helps to standardize the named entities and to build a database instance. To enable NER, we need to provide the knowledge base to use. For this tutorial, we will use the UMLS (Unified Medical Language System) resolver.
Let’s now parse more sentences from the corpora.
forest = await raw_load_corpus(
['MACCROBAT2020.zip'],
['English'],
parser=parser,
sample=800,
cache=False,
resolver_name='umls',
)
# Look at the highest tree
max(forest, key=lambda tree: tree.height()).pretty_print()
ROOT
_________________________________________________|__________________________________________________________________
| UNDEF_558e5aad26
| 904516a2182d927b
| 6c866e
| __________________________________________________________________________________________________|________________
| | UNDEF_5ad54bddfd
| | 1340888a3f20e369
| | 67a1a9
| | ___________________________________________________________________________________|______________________________________________________________________________________________________________________________________
| | UNDEF_a21d216a5a | |
| | 364c2999b1fe80c5 | |
| | f8baef | |
| | _______________|__________________________________ | |
| | | | UNDEF_9d933b0a18 | |
| | | | f94e468e906265f3 | |
| | | | a62b75 | |
| | | | ________________|_______________________________ | |
| | | | | | UNDEF_338ddec0b1 | |
| | | | | | 214bd3abe362eeac | |
| | | | | | 6b6405 | |
| | | | | | _______________|_________________________________ | |
| | | | | | | UNDEF_114b854648 | |
| | | | | | | a94925a9e11789a4 | |
| | | | | | | 9ee1c3 | |
| | | | | | | _________________________________|________________ | |
| | | | | | | | UNDEF_4491e39e1f | |
| | | | | | | | 0a4c019cd2725c99 | |
| | | | | | | | e22d29 | |
| | | | | | | | _________________________________|________________ | |
| | | | | | | | | UNDEF_05541df006 | |
| | | | | | | | | 2740829f4b6b52f8 | |
| | | | | | | | | 642793 | |
| | | | | | | | | _________________________________|________________ | |
| | | | | | | | | | UNDEF_6ce6501a84 | |
| | | | | | | | | | 4a42c99f441c878f | |
| | | | | | | | | | 0ac1eb | |
| | | | | | | | | | _________________________________|________________ | |
| | | | | | | | | | | UNDEF_baf0f548cd | |
| | | | | | | | | | | 6441e7ba8c64a708 | |
| | | | | | | | | | | 3e2b55 | |
| | | | | | | | | | | ________________|________________ | |
ENT::DIAGNOSTIC_ ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::LAB_VALUE
PROCEDURE PROCEDURE | PROCEDURE | PROCEDURE | PROCEDURE | PROCEDURE | PROCEDURE | PROCEDURE |
| | | | | | | | | | | | | | |
biochemical cardiac troponin was 0.714 mg nesiritide what 466,530 pg / ), serum d - was 8.14 mg blood urea nitro was 11.69 mmol creatinine measu was 144.00 mmol serum urate was 611.40 mmol and endogenous what 57.90 ml /
measurement dimer gen measurement rement, serum ( measurement creatinine clear min
procedure) ing
ArchiTXT can then automatically structure parsed text into a database-friendly format.
from architxt.simplification.tree_rewriting import rewrite
new_forest = rewrite(forest, epoch=20, min_support=10, tau=0.8)
# Look at the highest tree
max(new_forest, key=lambda tree: tree.height()).pretty_print()
ROOT
_______________________________________________________________________________________________________________|___________________________________________________
| | UNDEF_c8ba4b30f2
| | 6c4f158914c60bf2
| | 764d02
| | _________________________________________________________________________________|_______________________________________________________________________________________________________
COLL::0 | | COLL::0 | | | |
_____________________________|________________ | | ________________|_________________________________ | | | |
GROUP::0 GROUP::0 | GROUP::0 GROUP::0 GROUP::0 | | | |
__________|____________ ________________|________________ | ____________|____________ _________________|________________ ________________|_________________ | | | |
ENT::DURATION ENT::CLINICAL_EV ENT::CLINICAL_EV ENT::NONBIOLOGIC ENT::SIGN_SYMPTO ENT::DIAGNOSTIC_ ENT::SEVERITY ENT::SIGN_SYMPTO ENT::SIGN_SYMPTO ENT::BIOLOGICAL_ ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::SIGN_SYMPTO ENT::LAB_VALUE ENT::DIAGNOSTIC_ ENT::DIAGNOSTIC_ ENT::LAB_VALUE ENT::LAB_VALUE
| ENT ENT AL_LOCATION M PROCEDURE | M M STRUCTURE | PROCEDURE M | PROCEDURE PROCEDURE | |
| | | | | | | | | | | | | | | | | |
year follow-up status presentation hospitals fever echocardiography severe (severity aortic valve rupture one of the cusps worsening patter cardiac contract ectasia of left ( telediastolic / left ventricular ( lvef mental depressio ( 40 - 45 %
modifier) insufficiency of the homograft n ility alteration atrial appendage telesystolic ejection fractio n
diameters 60/42 n
mm
Now that we have a structured instance, we can extract its schema. The schema provides a formal representation of the extracted data.
from architxt.schema import Schema
schema = Schema.from_forest(new_forest, keep_unlabelled=False)
print(schema.as_cfg())
ROOT -> COLL::0 GROUP::0;
COLL::0 -> GROUP::0;
GROUP::0 -> ENT::ACTIVITY ENT::ADMINISTRATION ENT::AGE ENT::BIOLOGICAL_ATTRIBUTE ENT::BIOLOGICAL_STRUCTURE ENT::CLINICAL_EVENT ENT::COLOR ENT::COREFERENCE ENT::DATE ENT::DETAILED_DESCRIPTION ENT::DIAGNOSTIC_PROCEDURE ENT::DISEASE_DISORDER ENT::DISTANCE ENT::DOSAGE ENT::DURATION ENT::FAMILY_HISTORY ENT::FREQUENCY ENT::HISTORY ENT::LAB_VALUE ENT::MASS ENT::MEDICATION ENT::NONBIOLOGICAL_LOCATION ENT::OCCUPATION ENT::OTHER_EVENT ENT::OUTCOME ENT::PERSONAL_BACKGROUND ENT::QUALITATIVE_CONCEPT ENT::QUANTITATIVE_CONCEPT ENT::SEVERITY ENT::SEX ENT::SHAPE ENT::SIGN_SYMPTOM ENT::SUBJECT ENT::TEXTURE ENT::THERAPEUTIC_PROCEDURE ENT::TIME ENT::VOLUME;
Not all extracted trees contribute to meaningful insights. We can filter our structured instance to retain only valid trees:
cleaned_forest = schema.extract_valid_trees(new_forest)
Now that we have a structured dataset, we can explore the different semantic groups. Groups represent common patterns across the corpus.
datasets = schema.extract_datasets(new_forest)
group = set(datasets.keys()).pop()
datasets[group]
DISEASE_DISORDER | SIGN_SYMPTOM | SEVERITY | TEXTURE | BIOLOGICAL_STRUCTURE | AGE | SEX | PERSONAL_BACKGROUND | HISTORY | ACTIVITY | CLINICAL_EVENT | NONBIOLOGICAL_LOCATION | DATE | DETAILED_DESCRIPTION | LAB_VALUE | DURATION | DISTANCE | DIAGNOSTIC_PROCEDURE | BIOLOGICAL_ATTRIBUTE | COLOR | MEDICATION | DOSAGE | ADMINISTRATION | THERAPEUTIC_PROCEDURE | FREQUENCY | SHAPE | COREFERENCE | QUANTITATIVE_CONCEPT | FAMILY_HISTORY | SUBJECT | MASS | QUALITATIVE_CONCEPT | OUTCOME | TIME | OCCUPATION | VOLUME | OTHER_EVENT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the internet... (need help?) |