{ "cells": [ { "cell_type": "markdown", "id": "395d7f62-3dd8-458c-b42a-d766c761eb34", "metadata": {}, "source": [ "# Using `spaCy` NER with pyDeid" ] }, { "cell_type": "markdown", "id": "6fa12c0f-7895-4d70-9b85-24ee900c072b", "metadata": {}, "source": [ "## Basic `spaCy` NER pipeline" ] }, { "cell_type": "code", "execution_count": 1, "id": "ae982bee-72f6-4472-afc2-6fefcd5134d9", "metadata": {}, "outputs": [], "source": [ "import spacy" ] }, { "cell_type": "markdown", "id": "3051062e-3912-4650-b906-f64b92718aef", "metadata": {}, "source": [ "Custom `spaCy` NER pipelines can be added to pyDeid. `spaCy` ships four english pretrained pipelines by default:\n", "\n", "* `en_core_web_sm`\n", "* `en_core_web_md`\n", "* `en_core_web_lg`\n", "* `en_core_web_trf`\n", "\n", "Let's use the transformer-based pipeline in conjunction with pyDeid. This begins with downloading the pipeline." ] }, { "cell_type": "raw", "id": "6a102900-209a-4f7e-8cb7-8b79ee1ef7c9", "metadata": {}, "source": [ "!python -m spacy download en_core_web_lg" ] }, { "cell_type": "markdown", "id": "929d3179-ae6a-4bd7-bd59-230ef5230e40", "metadata": {}, "source": [ "Now we load the pipeline." ] }, { "cell_type": "code", "execution_count": 2, "id": "4490dbb1-d7f3-4efc-90f6-a6a1dd95a8db", "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load(\"en_core_web_lg\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "f0fac81e-6dc9-423d-b57d-cb4ed53c0787", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.pipe_names" ] }, { "cell_type": "markdown", "id": "2e8bb68d-18ac-47e1-b79e-7f3ea7b24f54", "metadata": {}, "source": [ "We can now pass the pipeline as-is to `pyDeid` to add to the set of de-identification steps." ] }, { "cell_type": "code", "execution_count": 4, "id": "b8a2691d-910c-482d-91c3-0687c2258f1e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Processing encounter 3, note Record 3: : 3it [00:00, 3.02it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Diagnostics:\n", " - chars/s = 131.73949687826678\n", " - s/note = 0.33146222432454425\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from pyDeid import pyDeid\n", "\n", "pyDeid(\n", " original_file = \"./../../tests/test.csv\",\n", " note_id_varname = \"note_id\",\n", " encounter_id_varname = \"genc_id\",\n", " note_varname = \"note_text\",\n", " ner_pipeline = nlp,\n", ")" ] }, { "cell_type": "markdown", "id": "4171cee2-9c4d-461a-a64f-23ff068dd6de", "metadata": {}, "source": [ "## Add `medspaCy` components" ] }, { "cell_type": "markdown", "id": "c04131f5-599b-478f-880f-694059179c69", "metadata": {}, "source": [ "Now let's instead use a `medspaCy` tokenizer and sentence parser.\n", "\n", "We begin by replacing the default tokenizer with the `medspacy_tokenizer`." ] }, { "cell_type": "code", "execution_count": 5, "id": "6b2dfc4d-ce52-44d4-9cff-fa04b95de9f2", "metadata": {}, "outputs": [], "source": [ "from medspacy.custom_tokenizer import create_medspacy_tokenizer" ] }, { "cell_type": "code", "execution_count": 6, "id": "0338b77f-af00-486c-bdc8-61e2791772ec", "metadata": {}, "outputs": [], "source": [ "medspacy_tokenizer = create_medspacy_tokenizer(nlp)" ] }, { "cell_type": "code", "execution_count": 7, "id": "d54246f9-b098-4a74-b816-4425de26834c", "metadata": {}, "outputs": [], "source": [ "nlp.tokenizer = medspacy_tokenizer" ] }, { "cell_type": "markdown", "id": "3f5a6488-67c9-4f8c-b593-b02ddd041b5c", "metadata": {}, "source": [ "Next we use `PyRuSH` sentence parsing." ] }, { "cell_type": "code", "execution_count": 8, "id": "1a3eda74-1a54-4c00-8052-0c8a1456d604", "metadata": {}, "outputs": [], "source": [ "from medspacy.sentence_splitting import PyRuSHSentencizer" ] }, { "cell_type": "code", "execution_count": 9, "id": "eed92698-06f3-46da-b8be-3ba8654bbc09", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['tok2vec',\n", " 'tagger',\n", " 'medspacy_pyrush',\n", " 'parser',\n", " 'attribute_ruler',\n", " 'lemmatizer',\n", " 'ner']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.add_pipe(\"medspacy_pyrush\", before=\"parser\")\n", "nlp.pipe_names" ] }, { "cell_type": "markdown", "id": "3d7d3ce3-afa9-467d-972f-3e49b0752aa7", "metadata": {}, "source": [ "We can pass this pipeline to `pyDeid` similarly." ] }, { "cell_type": "code", "execution_count": 16, "id": "38202984-f35b-4612-a4e5-4a06798d6221", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Processing encounter 3, note Record 3: : 3it [00:01, 2.77it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Diagnostics:\n", " - chars/s = 120.9813422064844\n", " - s/note = 0.3609371980031331\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "pyDeid(\n", " original_file = \"./../../tests/test.csv\",\n", " note_id_varname = \"note_id\",\n", " encounter_id_varname = \"genc_id\",\n", " note_varname = \"note_text\",\n", " ner_pipeline = nlp,\n", ")" ] }, { "cell_type": "markdown", "id": "38ca0589-9e60-4a56-9ab7-55368d35c2b0", "metadata": {}, "source": [ "## Create a custom NER pipeline for `pyDeid`" ] }, { "cell_type": "markdown", "id": "d101896e-d732-42fe-87cb-379bccc7dde7", "metadata": {}, "source": [ "Given that `pyDeid` can accept any `spaCy` NER pipeline, lets create a custom pipeline with:\n", "\n", "* A `med7` base model.\n", "* The `medspaCy` tokenizer.\n", "* The `medspaCy` sentence parser.\n", "* A preprocessing component that `truecases` the input text.\n", "\n", "Note that rather than the `Med7` model, we could easily have used any other `spaCy` model such as `SciSpaCy`." ] }, { "cell_type": "raw", "id": "c7013e7a-3960-4845-a30f-cf16c29a67f6", "metadata": {}, "source": [ "!pip install https://huggingface.co/kormilitzin/en_core_med7_trf/resolve/main/en_core_med7_lg-any-py3-none-any.whl" ] }, { "cell_type": "code", "execution_count": 10, "id": "b3acf73a-3805-49ad-b88e-ad78a15b1f93", "metadata": {}, "outputs": [], "source": [ "med7_model = \"en_core_med7_lg\" # the model downloaded above\n", "\n", "med7_nlp = spacy.load(med7_model)" ] }, { "cell_type": "code", "execution_count": 11, "id": "dc6c5ffe-0508-4097-9cfc-50d2b93f9715", "metadata": {}, "outputs": [ { "data": { "text/plain": [ " spacy.tokens.doc.Doc>" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from spacy.language import Language\n", "from spacy.tokens import Doc\n", "import truecase\n", "\n", "@Language.component(\"truecaser\")\n", "def truecaser(doc: Doc) -> Doc:\n", " \"\"\"Apply truecasing to the document text.\"\"\"\n", " truecased_text = truecase.get_true_case(doc.text)\n", " return Doc(doc.vocab, words=truecased_text.split())\n", "\n", "med7_nlp.add_pipe(\"truecaser\", first=True)" ] }, { "cell_type": "code", "execution_count": 12, "id": "fdddb515-a362-48ad-b68f-afd78ebad0f8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_spacy_nlp = spacy.load(\"en_core_web_lg\")\n", "\n", "# add medspaCy tokenizer and sentence parser\n", "med7_nlp.tokenizer = create_medspacy_tokenizer(med7_nlp)\n", "med7_nlp.add_pipe(\"medspacy_pyrush\", before=\"ner\")\n", "\n", "# replace SciSpaCy NER with spaCy NER\n", "med7_nlp.remove_pipe(\"ner\")\n", "med7_nlp.add_pipe(\"ner\", source = base_spacy_nlp)" ] }, { "cell_type": "code", "execution_count": 13, "id": "17533838-8d3e-471e-8407-d6b171fb9071", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['truecaser', 'tok2vec', 'medspacy_pyrush', 'ner']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "med7_nlp.pipe_names" ] }, { "cell_type": "markdown", "id": "2875f693-4e8d-4662-b437-bfdb42e6bfc3", "metadata": {}, "source": [ "Below we create a sample messy unstructured clinical note (produced by Claude 3.5 Sonnet). We will de-identify this note with our custom NER pipeline." ] }, { "cell_type": "code", "execution_count": 14, "id": "090892b2-5183-4ebe-b4a7-ebf2600dba10", "metadata": {}, "outputs": [], "source": [ "note = \"\"\"\n", "Pt: J. Smith, 45yo M\n", "CC: SOB, chest pain x 2 days\n", "Hx: HTN, T2DM, +smoker (1ppd x 20yrs)\n", "Meds: metformin, lisinopril\n", "VS: BP 145/92, HR 88, RR 20, T 37.2, SpO2 97% RA\n", "S) pt reports gradual onset SOB w/ exertion, worse w/ lying flat. denies fever, cough, leg swelling. + intermittent L sided chest pain, non-radiating, 6/10, worse w/ deep breath\n", "O) appears mildly SOB, speaking full sentences. Lungs: crackles bil bases. Heart: RRR, no m/r/g. Ext: no edema\n", "ECG: NSR, no ST changes\n", "Labs:\n", "CBC - WBC 9.2, Hgb 13.5, Plt 220\n", "BMP - Na 138, K 4.2, Cr 1.1, Glu 162\n", "Troponin neg\n", "CXR: cardiomegaly, no infiltrates\n", "A/P:\n", "\n", "Acute SOB - ?early CHF vs pneumonia\n", "\n", "start lasix 40 IV\n", "repeat CXR in AM\n", "trend troponin\n", "\n", "\n", "HTN - hold lisinopril, recheck BP in AM\n", "DM - cont home meds, check A1c\n", "Smoking - advised cessation, pt interested in patch\n", "\n", "Dispo: admit obs, f/u w/ cards\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "7ec01f63-0230-44c2-9644-ae55ddb0ae0e", "metadata": {}, "source": [ "Below we use the base spacy display functionality to display the captured entities." ] }, { "cell_type": "code", "execution_count": 15, "id": "c9452f83-0c6c-4cd5-b159-68196a4efc9f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " displaCy\n", " \n", "\n", " \n", "
\n", "
Pt: J. Smith, 45Yo M CC: Sob, chest pain X \n", "\n", " 2 days\n", " DATE\n", "\n", " Hx: Htn, T2Dm, +Smoker (1Ppd X 20Yrs) \n", "\n", " Meds: Metformin,\n", " WORK_OF_ART\n", "\n", " \n", "\n", " Lisinopril\n", " DATE\n", "\n", " vs: \n", "\n", " BP\n", " ORG\n", "\n", " 145/92, hr \n", "\n", " 88,\n", " CARDINAL\n", "\n", " Rr 20, t \n", "\n", " 37.2,\n", " CARDINAL\n", "\n", " Spo2 97% \n", "\n", " RA\n", " ORG\n", "\n", " s) PT reports gradual Onset Sob W/ exertion, worse W/ lying flat . DENIES fever, cough, leg swelling . + intermittent L sided chest pain, \n", "\n", " Non-Radiating, 6/10,\n", " PRODUCT\n", "\n", " worse W/ deep breath O) appears mildly \n", "\n", " Sob,\n", " PERSON\n", "\n", " speaking full sentences . lungs: crackles \n", "\n", " BIL\n", " ORG\n", "\n", " bases . heart: Rrr, no M/R/G . \n", "\n", " Ext:\n", " PERSON\n", "\n", " no edema Ecg: Nsr, no St changes LABS: CBC - Wbc 9.2, Hgb 13.5, PLT \n", "\n", " 220\n", " CARDINAL\n", "\n", " BMP - NA 138, K 4.2, Cr 1.1, Glu 162 Troponin Neg CXR: cardiomegaly, no Infiltrates A/P: acute \n", "\n", " Sob\n", " PERSON\n", "\n", " -? early Chf vs pneumonia start \n", "\n", " Lasix\n", " PERSON\n", "\n", " 40 IV repeat CXR in am trend \n", "\n", " Troponin Htn\n", " PERSON\n", "\n", " - hold \n", "\n", " Lisinopril,\n", " ORG\n", "\n", " recheck \n", "\n", " BP\n", " ORG\n", "\n", " in am \n", "\n", " DM - CONT\n", " ORG\n", "\n", " home Meds, check A1C smoking - advised \n", "\n", " cessation, PT\n", " ORG\n", "\n", " interested in patch \n", "\n", " Dispo:\n", " GPE\n", "\n", " admit Obs, F/U W/ cards
\n", "
\n", "\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Using the 'ent' visualizer\n", "Serving on http://0.0.0.0:5000 ...\n", "\n", "Shutting down server on port 5000.\n" ] } ], "source": [ "from spacy import displacy\n", "doc = med7_nlp(note)\n", "displacy.serve(doc, style=\"ent\")" ] }, { "cell_type": "markdown", "id": "77200312-7be3-4acc-b10e-9962ed16fa97", "metadata": {}, "source": [ "We can pass this custom pipeline to `deid_string`, similarly to how we did for `pyDeid`." ] }, { "cell_type": "code", "execution_count": 20, "id": "910fcfd8-ccc1-4754-9521-656e170bf0cb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "

Pt: \n", "\n", " J.\n", " PHI\n", "\n", " \n", "\n", " Smith\n", " NAME\n", "\n", ", 45yo M
CC: SOB, chest pain x 2 days
Hx: HTN, T2DM, +smoker (1ppd x 20yrs)
Meds: metformin, lisinopril
VS: BP 145/92, HR 88, RR 20, T 37.2, SpO2 97% RA
S) pt reports gradual onset SOB w/ exertion, worse w/ lying flat. denies fever, cough, leg swelling. + intermittent L sided chest pain, non-radiating, \n", "\n", " 6/10\n", " DATE\n", "\n", ", worse w/ deep breath
O) appears mildly S\n", "\n", " OB, \n", " NAME\n", "\n", "speaking full sentences. Lungs: crackles bil bases. Heart: RRR, no m/r/g. Ext:\n", "\n", " no \n", " NAME\n", "\n", "edema
ECG: NSR, no ST changes
Labs:
CBC - WBC 9.2, Hgb 13.5, Plt 220
BMP - Na 138, K 4.2, Cr 1.1, Glu 162
Troponin neg
CXR: cardiomegaly, no infiltrates
A/P:

Acute SOB\n", "\n", " - \n", " NAME\n", "\n", "?early CHF vs pneumonia

start la\n", "\n", " six 4\n", " NAME\n", "\n", "0 IV
repeat CXR in AM
trend tr\n", "\n", " oponin\n", "\n", "\n", " NAME\n", "\n", "
\n", "\n", " HTN\n", " NAME\n", "\n", " - hold lisinopril, recheck BP in AM
DM - cont home meds, check A1c
Smoking - advised cessation, pt interested in patch

Dispo: admit obs, f/u w/ cards
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pyDeid import deid_string, display_deid\n", "\n", "surrogates, new_note = deid_string(note, ner_pipeline = med7_nlp)\n", "\n", "display_deid(note, surrogates)" ] }, { "cell_type": "markdown", "id": "a63c4917-f110-48c5-bf8a-22e6cbe57774", "metadata": {}, "source": [ "Perhaps these false positives can be handled by fine-tuning this NER pipeline, which may be part of some future work." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }