Using spaCy NER with pyDeid

Basic spaCy NER pipeline

[1]:
import spacy

Custom spaCy NER pipelines can be added to pyDeid. spaCy ships four english pretrained pipelines by default:

  • en_core_web_sm

  • en_core_web_md

  • en_core_web_lg

  • en_core_web_trf

Let’s use the transformer-based pipeline in conjunction with pyDeid. This begins with downloading the pipeline.

!python -m spacy download en_core_web_lg

Now we load the pipeline.

[2]:
nlp = spacy.load("en_core_web_lg")
[3]:
nlp.pipe_names
[3]:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

We can now pass the pipeline as-is to pyDeid to add to the set of de-identification steps.

[4]:
from pyDeid import pyDeid

pyDeid(
    original_file = "./../../tests/test.csv",
    note_id_varname = "note_id",
    encounter_id_varname = "genc_id",
    note_varname = "note_text",
    ner_pipeline = nlp,
)
Processing encounter 3, note Record 3: : 3it [00:00,  3.02it/s]
Diagnostics:
                - chars/s = 131.73949687826678
                - s/note = 0.33146222432454425

Add medspaCy components

Now let’s instead use a medspaCy tokenizer and sentence parser.

We begin by replacing the default tokenizer with the medspacy_tokenizer.

[5]:
from medspacy.custom_tokenizer import create_medspacy_tokenizer
[6]:
medspacy_tokenizer = create_medspacy_tokenizer(nlp)
[7]:
nlp.tokenizer = medspacy_tokenizer

Next we use PyRuSH sentence parsing.

[8]:
from medspacy.sentence_splitting import PyRuSHSentencizer
[9]:
nlp.add_pipe("medspacy_pyrush", before="parser")
nlp.pipe_names
[9]:
['tok2vec',
 'tagger',
 'medspacy_pyrush',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

We can pass this pipeline to pyDeid similarly.

[16]:
pyDeid(
    original_file = "./../../tests/test.csv",
    note_id_varname = "note_id",
    encounter_id_varname = "genc_id",
    note_varname = "note_text",
    ner_pipeline = nlp,
)
Processing encounter 3, note Record 3: : 3it [00:01,  2.77it/s]
Diagnostics:
                - chars/s = 120.9813422064844
                - s/note = 0.3609371980031331

Create a custom NER pipeline for pyDeid

Given that pyDeid can accept any spaCy NER pipeline, lets create a custom pipeline with:

  • A med7 base model.

  • The medspaCy tokenizer.

  • The medspaCy sentence parser.

  • A preprocessing component that truecases the input text.

Note that rather than the Med7 model, we could easily have used any other spaCy model such as SciSpaCy.

!pip install https://huggingface.co/kormilitzin/en_core_med7_trf/resolve/main/en_core_med7_lg-any-py3-none-any.whl
[10]:
med7_model = "en_core_med7_lg" # the model downloaded above

med7_nlp = spacy.load(med7_model)
[11]:
from spacy.language import Language
from spacy.tokens import Doc
import truecase

@Language.component("truecaser")
def truecaser(doc: Doc) -> Doc:
    """Apply truecasing to the document text."""
    truecased_text = truecase.get_true_case(doc.text)
    return Doc(doc.vocab, words=truecased_text.split())

med7_nlp.add_pipe("truecaser", first=True)
[11]:
<function __main__.truecaser(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>
[12]:
base_spacy_nlp = spacy.load("en_core_web_lg")

# add medspaCy tokenizer and sentence parser
med7_nlp.tokenizer = create_medspacy_tokenizer(med7_nlp)
med7_nlp.add_pipe("medspacy_pyrush", before="ner")

# replace SciSpaCy NER with spaCy NER
med7_nlp.remove_pipe("ner")
med7_nlp.add_pipe("ner", source = base_spacy_nlp)
[12]:
<spacy.pipeline.ner.EntityRecognizer at 0x2bcd7cc8270>
[13]:
med7_nlp.pipe_names
[13]:
['truecaser', 'tok2vec', 'medspacy_pyrush', 'ner']

Below we create a sample messy unstructured clinical note (produced by Claude 3.5 Sonnet). We will de-identify this note with our custom NER pipeline.

[14]:
note = """
Pt: J. Smith, 45yo M
CC: SOB, chest pain x 2 days
Hx: HTN, T2DM, +smoker (1ppd x 20yrs)
Meds: metformin, lisinopril
VS: BP 145/92, HR 88, RR 20, T 37.2, SpO2 97% RA
S) pt reports gradual onset SOB w/ exertion, worse w/ lying flat. denies fever, cough, leg swelling. + intermittent L sided chest pain, non-radiating, 6/10, worse w/ deep breath
O) appears mildly SOB, speaking full sentences. Lungs: crackles bil bases. Heart: RRR, no m/r/g. Ext: no edema
ECG: NSR, no ST changes
Labs:
CBC - WBC 9.2, Hgb 13.5, Plt 220
BMP - Na 138, K 4.2, Cr 1.1, Glu 162
Troponin neg
CXR: cardiomegaly, no infiltrates
A/P:

Acute SOB - ?early CHF vs pneumonia

start lasix 40 IV
repeat CXR in AM
trend troponin


HTN - hold lisinopril, recheck BP in AM
DM - cont home meds, check A1c
Smoking - advised cessation, pt interested in patch

Dispo: admit obs, f/u w/ cards
"""

Below we use the base spacy display functionality to display the captured entities.

[15]:
from spacy import displacy
doc = med7_nlp(note)
displacy.serve(doc, style="ent")
displaCy
Pt: J. Smith, 45Yo M CC: Sob, chest pain X 2 days DATE Hx: Htn, T2Dm, +Smoker (1Ppd X 20Yrs) Meds: Metformin, WORK_OF_ART Lisinopril DATE vs: BP ORG 145/92, hr 88, CARDINAL Rr 20, t 37.2, CARDINAL Spo2 97% RA ORG s) PT reports gradual Onset Sob W/ exertion, worse W/ lying flat . DENIES fever, cough, leg swelling . + intermittent L sided chest pain, Non-Radiating, 6/10, PRODUCT worse W/ deep breath O) appears mildly Sob, PERSON speaking full sentences . lungs: crackles BIL ORG bases . heart: Rrr, no M/R/G . Ext: PERSON no edema Ecg: Nsr, no St changes LABS: CBC - Wbc 9.2, Hgb 13.5, PLT 220 CARDINAL BMP - NA 138, K 4.2, Cr 1.1, Glu 162 Troponin Neg CXR: cardiomegaly, no Infiltrates A/P: acute Sob PERSON -? early Chf vs pneumonia start Lasix PERSON 40 IV repeat CXR in am trend Troponin Htn PERSON - hold Lisinopril, ORG recheck BP ORG in am DM - CONT ORG home Meds, check A1C smoking - advised cessation, PT ORG interested in patch Dispo: GPE admit Obs, F/U W/ cards

Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.

We can pass this custom pipeline to deid_string, similarly to how we did for pyDeid.

[20]:
from pyDeid import deid_string, display_deid

surrogates, new_note = deid_string(note, ner_pipeline = med7_nlp)

display_deid(note, surrogates)

Pt: J. PHI Smith NAME , 45yo M
CC: SOB, chest pain x 2 days
Hx: HTN, T2DM, +smoker (1ppd x 20yrs)
Meds: metformin, lisinopril
VS: BP 145/92, HR 88, RR 20, T 37.2, SpO2 97% RA
S) pt reports gradual onset SOB w/ exertion, worse w/ lying flat. denies fever, cough, leg swelling. + intermittent L sided chest pain, non-radiating, 6/10 DATE , worse w/ deep breath
O) appears mildly S OB, NAME speaking full sentences. Lungs: crackles bil bases. Heart: RRR, no m/r/g. Ext: no NAME edema
ECG: NSR, no ST changes
Labs:
CBC - WBC 9.2, Hgb 13.5, Plt 220
BMP - Na 138, K 4.2, Cr 1.1, Glu 162
Troponin neg
CXR: cardiomegaly, no infiltrates
A/P:

Acute SOB - NAME ?early CHF vs pneumonia

start la six 4 NAME 0 IV
repeat CXR in AM
trend tr oponin NAME
HTN NAME - hold lisinopril, recheck BP in AM
DM - cont home meds, check A1c
Smoking - advised cessation, pt interested in patch

Dispo: admit obs, f/u w/ cards

Perhaps these false positives can be handled by fine-tuning this NER pipeline, which may be part of some future work.