Using spaCy
NER with pyDeid¶
Basic spaCy
NER pipeline¶
[1]:
import spacy
Custom spaCy
NER pipelines can be added to pyDeid. spaCy
ships four english pretrained pipelines by default:
en_core_web_sm
en_core_web_md
en_core_web_lg
en_core_web_trf
Let’s use the transformer-based pipeline in conjunction with pyDeid. This begins with downloading the pipeline.
!python -m spacy download en_core_web_lgNow we load the pipeline.
[2]:
nlp = spacy.load("en_core_web_lg")
[3]:
nlp.pipe_names
[3]:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
We can now pass the pipeline as-is to pyDeid
to add to the set of de-identification steps.
[4]:
from pyDeid import pyDeid
pyDeid(
original_file = "./../../tests/test.csv",
note_id_varname = "note_id",
encounter_id_varname = "genc_id",
note_varname = "note_text",
ner_pipeline = nlp,
)
Processing encounter 3, note Record 3: : 3it [00:00, 3.02it/s]
Diagnostics:
- chars/s = 131.73949687826678
- s/note = 0.33146222432454425
Add medspaCy
components¶
Now let’s instead use a medspaCy
tokenizer and sentence parser.
We begin by replacing the default tokenizer with the medspacy_tokenizer
.
[5]:
from medspacy.custom_tokenizer import create_medspacy_tokenizer
[6]:
medspacy_tokenizer = create_medspacy_tokenizer(nlp)
[7]:
nlp.tokenizer = medspacy_tokenizer
Next we use PyRuSH
sentence parsing.
[8]:
from medspacy.sentence_splitting import PyRuSHSentencizer
[9]:
nlp.add_pipe("medspacy_pyrush", before="parser")
nlp.pipe_names
[9]:
['tok2vec',
'tagger',
'medspacy_pyrush',
'parser',
'attribute_ruler',
'lemmatizer',
'ner']
We can pass this pipeline to pyDeid
similarly.
[16]:
pyDeid(
original_file = "./../../tests/test.csv",
note_id_varname = "note_id",
encounter_id_varname = "genc_id",
note_varname = "note_text",
ner_pipeline = nlp,
)
Processing encounter 3, note Record 3: : 3it [00:01, 2.77it/s]
Diagnostics:
- chars/s = 120.9813422064844
- s/note = 0.3609371980031331
Create a custom NER pipeline for pyDeid
¶
Given that pyDeid
can accept any spaCy
NER pipeline, lets create a custom pipeline with:
A
med7
base model.The
medspaCy
tokenizer.The
medspaCy
sentence parser.A preprocessing component that
truecases
the input text.
Note that rather than the Med7
model, we could easily have used any other spaCy
model such as SciSpaCy
.
[10]:
med7_model = "en_core_med7_lg" # the model downloaded above
med7_nlp = spacy.load(med7_model)
[11]:
from spacy.language import Language
from spacy.tokens import Doc
import truecase
@Language.component("truecaser")
def truecaser(doc: Doc) -> Doc:
"""Apply truecasing to the document text."""
truecased_text = truecase.get_true_case(doc.text)
return Doc(doc.vocab, words=truecased_text.split())
med7_nlp.add_pipe("truecaser", first=True)
[11]:
<function __main__.truecaser(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>
[12]:
base_spacy_nlp = spacy.load("en_core_web_lg")
# add medspaCy tokenizer and sentence parser
med7_nlp.tokenizer = create_medspacy_tokenizer(med7_nlp)
med7_nlp.add_pipe("medspacy_pyrush", before="ner")
# replace SciSpaCy NER with spaCy NER
med7_nlp.remove_pipe("ner")
med7_nlp.add_pipe("ner", source = base_spacy_nlp)
[12]:
<spacy.pipeline.ner.EntityRecognizer at 0x2bcd7cc8270>
[13]:
med7_nlp.pipe_names
[13]:
['truecaser', 'tok2vec', 'medspacy_pyrush', 'ner']
Below we create a sample messy unstructured clinical note (produced by Claude 3.5 Sonnet). We will de-identify this note with our custom NER pipeline.
[14]:
note = """
Pt: J. Smith, 45yo M
CC: SOB, chest pain x 2 days
Hx: HTN, T2DM, +smoker (1ppd x 20yrs)
Meds: metformin, lisinopril
VS: BP 145/92, HR 88, RR 20, T 37.2, SpO2 97% RA
S) pt reports gradual onset SOB w/ exertion, worse w/ lying flat. denies fever, cough, leg swelling. + intermittent L sided chest pain, non-radiating, 6/10, worse w/ deep breath
O) appears mildly SOB, speaking full sentences. Lungs: crackles bil bases. Heart: RRR, no m/r/g. Ext: no edema
ECG: NSR, no ST changes
Labs:
CBC - WBC 9.2, Hgb 13.5, Plt 220
BMP - Na 138, K 4.2, Cr 1.1, Glu 162
Troponin neg
CXR: cardiomegaly, no infiltrates
A/P:
Acute SOB - ?early CHF vs pneumonia
start lasix 40 IV
repeat CXR in AM
trend troponin
HTN - hold lisinopril, recheck BP in AM
DM - cont home meds, check A1c
Smoking - advised cessation, pt interested in patch
Dispo: admit obs, f/u w/ cards
"""
Below we use the base spacy display functionality to display the captured entities.
[15]:
from spacy import displacy
doc = med7_nlp(note)
displacy.serve(doc, style="ent")
Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...
Shutting down server on port 5000.
We can pass this custom pipeline to deid_string
, similarly to how we did for pyDeid
.
[20]:
from pyDeid import deid_string, display_deid
surrogates, new_note = deid_string(note, ner_pipeline = med7_nlp)
display_deid(note, surrogates)
Perhaps these false positives can be handled by fine-tuning this NER pipeline, which may be part of some future work.