Using `spaCy` NER with pyDeid¶

Basic `spaCy` NER pipeline¶

[1]:

import spacy

Custom spaCy NER pipelines can be added to pyDeid. spaCy ships four english pretrained pipelines by default:

en_core_web_sm
en_core_web_md
en_core_web_lg
en_core_web_trf

Let’s use the transformer-based pipeline in conjunction with pyDeid. This begins with downloading the pipeline.

!python -m spacy download en_core_web_lg

Now we load the pipeline.

[2]:

nlp = spacy.load("en_core_web_lg")

[3]:

nlp.pipe_names

[3]:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

We can now pass the pipeline as-is to pyDeid to add to the set of de-identification steps.

[4]:

from pyDeid import pyDeid

pyDeid(
    original_file = "./../../tests/test.csv",
    note_id_varname = "note_id",
    encounter_id_varname = "genc_id",
    note_varname = "note_text",
    ner_pipeline = nlp,
)

Processing encounter 3, note Record 3: : 3it [00:00,  3.02it/s]

Diagnostics:
                - chars/s = 131.73949687826678
                - s/note = 0.33146222432454425

Add `medspaCy` components¶

Now let’s instead use a medspaCy tokenizer and sentence parser.

We begin by replacing the default tokenizer with the medspacy_tokenizer.

[5]:

from medspacy.custom_tokenizer import create_medspacy_tokenizer

[6]:

medspacy_tokenizer = create_medspacy_tokenizer(nlp)

[7]:

nlp.tokenizer = medspacy_tokenizer

Next we use PyRuSH sentence parsing.

[8]:

from medspacy.sentence_splitting import PyRuSHSentencizer

[9]:

nlp.add_pipe("medspacy_pyrush", before="parser")
nlp.pipe_names

[9]:

['tok2vec',
 'tagger',
 'medspacy_pyrush',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

We can pass this pipeline to pyDeid similarly.

[16]:

pyDeid(
    original_file = "./../../tests/test.csv",
    note_id_varname = "note_id",
    encounter_id_varname = "genc_id",
    note_varname = "note_text",
    ner_pipeline = nlp,
)

Processing encounter 3, note Record 3: : 3it [00:01,  2.77it/s]

Diagnostics:
                - chars/s = 120.9813422064844
                - s/note = 0.3609371980031331

Create a custom NER pipeline for `pyDeid`¶

Given that pyDeid can accept any spaCy NER pipeline, lets create a custom pipeline with:

A med7 base model.
The medspaCy tokenizer.
The medspaCy sentence parser.
A preprocessing component that truecases the input text.

Note that rather than the Med7 model, we could easily have used any other spaCy model such as SciSpaCy.

!pip install https://huggingface.co/kormilitzin/en_core_med7_trf/resolve/main/en_core_med7_lg-any-py3-none-any.whl

[10]:

med7_model = "en_core_med7_lg" # the model downloaded above

med7_nlp = spacy.load(med7_model)

[11]:

from spacy.language import Language
from spacy.tokens import Doc
import truecase

@Language.component("truecaser")
def truecaser(doc: Doc) -> Doc:
    """Apply truecasing to the document text."""
    truecased_text = truecase.get_true_case(doc.text)
    return Doc(doc.vocab, words=truecased_text.split())

med7_nlp.add_pipe("truecaser", first=True)

[11]:

<function __main__.truecaser(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>

[12]:

base_spacy_nlp = spacy.load("en_core_web_lg")

# add medspaCy tokenizer and sentence parser
med7_nlp.tokenizer = create_medspacy_tokenizer(med7_nlp)
med7_nlp.add_pipe("medspacy_pyrush", before="ner")

# replace SciSpaCy NER with spaCy NER
med7_nlp.remove_pipe("ner")
med7_nlp.add_pipe("ner", source = base_spacy_nlp)

[12]:

<spacy.pipeline.ner.EntityRecognizer at 0x2bcd7cc8270>

[13]:

med7_nlp.pipe_names

[13]:

['truecaser', 'tok2vec', 'medspacy_pyrush', 'ner']

Below we create a sample messy unstructured clinical note (produced by Claude 3.5 Sonnet). We will de-identify this note with our custom NER pipeline.

[14]:

note = """
Pt: J. Smith, 45yo M
CC: SOB, chest pain x 2 days
Hx: HTN, T2DM, +smoker (1ppd x 20yrs)
Meds: metformin, lisinopril
VS: BP 145/92, HR 88, RR 20, T 37.2, SpO2 97% RA
S) pt reports gradual onset SOB w/ exertion, worse w/ lying flat. denies fever, cough, leg swelling. + intermittent L sided chest pain, non-radiating, 6/10, worse w/ deep breath
O) appears mildly SOB, speaking full sentences. Lungs: crackles bil bases. Heart: RRR, no m/r/g. Ext: no edema
ECG: NSR, no ST changes
Labs:
CBC - WBC 9.2, Hgb 13.5, Plt 220
BMP - Na 138, K 4.2, Cr 1.1, Glu 162
Troponin neg
CXR: cardiomegaly, no infiltrates
A/P:

Acute SOB - ?early CHF vs pneumonia

start lasix 40 IV
repeat CXR in AM
trend troponin


HTN - hold lisinopril, recheck BP in AM
DM - cont home meds, check A1c
Smoking - advised cessation, pt interested in patch

Dispo: admit obs, f/u w/ cards
"""

Below we use the base spacy display functionality to display the captured entities.

[15]:

from spacy import displacy
doc = med7_nlp(note)
displacy.serve(doc, style="ent")

displaCy

Pt: J. Smith, 45Yo M CC: Sob, chest pain X 2 days DATE Hx: Htn, T2Dm, +Smoker (1Ppd X 20Yrs) Meds: Metformin, WORK_OF_ART Lisinopril DATE vs: BP ORG 145/92, hr 88, CARDINAL Rr 20, t 37.2, CARDINAL Spo2 97% RA ORG s) PT reports gradual Onset Sob W/ exertion, worse W/ lying flat . DENIES fever, cough, leg swelling . + intermittent L sided chest pain, Non-Radiating, 6/10, PRODUCT worse W/ deep breath O) appears mildly Sob, PERSON speaking full sentences . lungs: crackles BIL ORG bases . heart: Rrr, no M/R/G . Ext: PERSON no edema Ecg: Nsr, no St changes LABS: CBC - Wbc 9.2, Hgb 13.5, PLT 220 CARDINAL BMP - NA 138, K 4.2, Cr 1.1, Glu 162 Troponin Neg CXR: cardiomegaly, no Infiltrates A/P: acute Sob PERSON -? early Chf vs pneumonia start Lasix PERSON 40 IV repeat CXR in am trend Troponin Htn PERSON - hold Lisinopril, ORG recheck BP ORG in am DM - CONT ORG home Meds, check A1C smoking - advised cessation, PT ORG interested in patch Dispo: GPE admit Obs, F/U W/ cards


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.

We can pass this custom pipeline to deid_string, similarly to how we did for pyDeid.

[20]:

from pyDeid import deid_string, display_deid

surrogates, new_note = deid_string(note, ner_pipeline = med7_nlp)

display_deid(note, surrogates)

Pt: J. PHI Smith NAME , 45yo M
CC: SOB, chest pain x 2 days
Hx: HTN, T2DM, +smoker (1ppd x 20yrs)
Meds: metformin, lisinopril
VS: BP 145/92, HR 88, RR 20, T 37.2, SpO2 97% RA
S) pt reports gradual onset SOB w/ exertion, worse w/ lying flat. denies fever, cough, leg swelling. + intermittent L sided chest pain, non-radiating, 6/10 DATE , worse w/ deep breath
O) appears mildly S OB, NAME speaking full sentences. Lungs: crackles bil bases. Heart: RRR, no m/r/g. Ext: no NAME edema
ECG: NSR, no ST changes
Labs:
CBC - WBC 9.2, Hgb 13.5, Plt 220
BMP - Na 138, K 4.2, Cr 1.1, Glu 162
Troponin neg
CXR: cardiomegaly, no infiltrates
A/P:

Acute SOB - NAME ?early CHF vs pneumonia

start la six 4 NAME 0 IV
repeat CXR in AM
trend tr oponin NAME
HTN NAME - hold lisinopril, recheck BP in AM
DM - cont home meds, check A1c
Smoking - advised cessation, pt interested in patch

Dispo: admit obs, f/u w/ cards

Perhaps these false positives can be handled by fine-tuning this NER pipeline, which may be part of some future work.

Using spaCy NER with pyDeid¶

Basic spaCy NER pipeline¶

Add medspaCy components¶

Create a custom NER pipeline for pyDeid¶

Using `spaCy` NER with pyDeid¶

Basic `spaCy` NER pipeline¶

Add `medspaCy` components¶

Create a custom NER pipeline for `pyDeid`¶