Exploring a Textual Corpus with ArchiTXT
========================================

This tutorial provides a **step-by-step guide** on how to use **ArchiTXT** to efficiently process and analyze textual corpora.

ArchiTXT allows loading a corpus as a set of syntax trees, where each tree is enriched by incorporating named entities.
These enriched trees form a **forest**, which can then be automatically structured into a valid **database instance** for further analysis.

By following this tutorial, you'll learn how to:
- Load a corpus
- Parse textual data with **Berkeley Neural Parser (Benepar)**
- Extract structured data using **ArchiTXT**

In [None]:
!pip install git+https://github.com/Neplex/ArchiTXT.git#egg=architxt

In [None]:
import itables

itables.init_notebook_mode(connected=True)

## Downloading the MACCROBAT Corpus
The **MACCROBAT** corpus is a collection of **200 annotated medical documents**, specifically **clinical case reports**, extracted from **PubMed Central**.
The annotations focus on key medical concepts such as **diseases, treatments, medications, and symptoms**, making it a valuable resource for biomedical text analysis.

The **MACCROBAT** corpus is available for download at [Figshare](https://figshare.com/articles/dataset/MACCROBAT2018/9764942).

Let's download the corpora.

In [None]:
import io
import urllib.request
import zipfile

with urllib.request.urlopen('https://figshare.com/ndownloader/articles/9764942/versions/2') as response:
    archive_file = io.BytesIO(response.read())

with zipfile.ZipFile(archive_file) as archive:
    archive.extract('MACCROBAT2020.zip')

## Installing and Configuring NLP Models

ArchiTXT can parse the sentences using either **Benepar** with SpaCy or a **CoreNLP** server.
In this tutorial, we will use the **SpaCy parser** with the default model, but you can use any models like one from **SciSpaCy**, a collection of models designed for biomedical text processing by **AllenAI**.

To download the SciSpaCy model, do:

In [None]:
!spacy download en_core_web_sm

We also need to download the Benepar model for English

In [None]:
import benepar

benepar.download('benepar_en3')

## Parsing the Corpus with ArchiTXT

Before processing the corpus, we need to configure the **BeneparParser**, specifying which SpaCy model to use for each language.

In [None]:
import warnings

from architxt.nlp.parser.benepar import BeneparParser

# Initialize the parser
parser = BeneparParser(
    spacy_models={
        'English': 'en_core_web_sm',
    }
)

# Suppress warnings for unsupported annotations
warnings.filterwarnings("ignore")

To ensure everything is working correctly, we first parse a small set of sentences from the corpus.

In [None]:
from architxt.nlp import raw_load_corpus

forest = await raw_load_corpus(
    ['MACCROBAT2020.zip'],
    ['English'],
    parser=parser,
    sample=10,
    cache=False,
)

We can look at our enriched tree using the `pretty_print` method.

In [None]:
# Look at the highest tree
max(forest, key=lambda tree: tree.height()).pretty_print()

Named Entity Resolution (NER) helps to standardize the named entities and to build a database instance.
To enable NER, we need to provide the knowledge base to use.
For this tutorial, we will use the **UMLS (Unified Medical Language System)** resolver.

Let's now parse more sentences from the corpora.

In [None]:
forest = await raw_load_corpus(
    ['MACCROBAT2020.zip'],
    ['English'],
    parser=parser,
    sample=800,
    cache=False,
    resolver_name='umls',
)

In [None]:
# Look at the highest tree
max(forest, key=lambda tree: tree.height()).pretty_print()

 **ArchiTXT** can then automatically structure parsed text into a **database-friendly format**.

In [None]:
from architxt.simplification.tree_rewriting import rewrite

new_forest = rewrite(forest, epoch=20, min_support=10, tau=0.8)

In [None]:
# Look at the highest tree
max(new_forest, key=lambda tree: tree.height()).pretty_print()

Now that we have a structured instance, we can extract its schema.
The schema provides a **formal representation** of the extracted data.

In [None]:
from architxt.schema import Schema

schema = Schema.from_forest(new_forest, keep_unlabelled=False)
print(schema.as_cfg())

Not all extracted trees contribute to meaningful insights.
We can filter our structured instance to retain only **valid trees**:

In [None]:
cleaned_forest = schema.extract_valid_trees(new_forest)

Now that we have a structured dataset, we can explore the different **semantic groups**.
Groups represent common patterns across the corpus.

In [None]:
datasets = schema.extract_datasets(new_forest)
group = set(datasets.keys()).pop()

datasets[group]