Getting Started With pyDeid

Follow the installation instructions in the README.md before running this notebook.

Running a basic example

If running this notebook locally, without using a virtual environment- run the block of code below (note that '/path/from/pip/show/pydeid' can be found with !pip show pydeid):

!pip show pydeidimport sys sys.path.insert(0,'/path/from/pip/show/pydeid')

For this demo, import the following functions from pyDeid.

deid_string(), reid_string(), and display_deid() are useful shortcut functions for exploring pyDeid features, to test and debug, and may be useful to investigate errors if they occur during the bulk de-identification process described later in the tutorial.

[3]:
from pyDeid import pyDeid, deid_string, reid_string, display_deid

Test out the installation using the following example:

[4]:
original_string = 'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'
phi, new_string = deid_string(original_string)

deid_string() takes as input a string, and outputs a new_string() with the PHI found in the original string replaced with surrogates, as well as a phi list of information regarding the found PHI:

[5]:
phi
[5]:
[{'phi_start': 0,
  'phi_end': 6,
  'phi': 'Elijah',
  'surrogate_start': 0,
  'surrogate_end': 4,
  'surrogate': 'Mary',
  'types': ['Male First Name (un)',
   'Last Name (un)',
   'First Name (followed by last name)']},
 {'phi_start': 7,
  'phi_end': 11,
  'phi': 'Wood',
  'surrogate_start': 5,
  'surrogate_end': 12,
  'surrogate': 'Edwards',
  'types': ['Last Name (ambig)', 'Last Name (follows first name)']},
 {'phi_start': 58,
  'phi_end': 75,
  'phi': Date(date_string='December 10, 2001', day='10', month='December', year='2001'),
  'surrogate_start': 59,
  'surrogate_end': 71,
  'surrogate': '2006-6th-Aug',
  'types': ['Month Day Year [Month-dd-yy(yy)]',
   'Month Day Year [Month dd, yy(yy)]',
   'Month Day Year [Month dd, yy(yy)]']}]
[6]:
new_string
[6]:
'Mary Edwards starred in The Lord of the Rings, released on 2006-6th-Aug'

pyDeid Features

display_deid() allows for visualization of the de-identification in interactive settings such as in Jupyter notebooks. This can be useful for demonstration and debugging:

[7]:
display_deid(original_string, phi)
Elijah NAME Wood NAME starred in The Lord of the Rings, released on December 10, 2001 DATE

We can also re-identify the string to return back the original string:

[8]:
reid_string(new_string, phi)
[8]:
'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'

Some contexts may have known custom patterns for PHI that we’d like to also de-identify.

We can do this by supplying a custom regular expression as a named argument to deid_string() as follows:

[9]:
original_string = 'Elves use 1 yén to refer to 144 of our years'

phi, new_string = deid_string(original_string, solar_years = '\\d\syén')
Supplied custom regexes through **kwargs (see custom_regexes in docstring):

- solar_years : \d\syén

These custom patterns will be replaced with <PHI>.

[10]:
new_string
[10]:
'Elves use <PHI> to refer to 144 of our years'

Note that the name of the argument that was supplied is used to identify the PHI type:

[11]:
phi
[11]:
[{'phi_start': 10,
  'phi_end': 15,
  'phi': '1 yén',
  'surrogate_start': 10,
  'surrogate_end': 15,
  'surrogate': '<PHI>',
  'types': ['solar_years']}]

Currently these “custom” regular expressions are replaced with <PHI> placeholders, but in the future the user will be able to supply a custom replacement string generator function.

We also have the ability to use a spaCy named entity recognition pass on the string to identify any missed names.

See how rare names are treated without named entity recognition:

[12]:
original_string = 'Frodo Baggins was born in Middle Earth.'

phi, new_string = deid_string(original_string)

display_deid(original_string, phi)
Frodo Baggins was born in Middle Earth.

And with named entity recognition:

[13]:
import spacy

nlp = spacy.load("en_core_web_lg")
[14]:
phi, new_string = deid_string(
    original_string,
    ner_pipeline=nlp
)

display_deid(original_string, phi)
Frodo NAME Baggins NAME was born in Middle Earth.

However, if we have access to patient and doctor names, we will not need to use named entity recognition in our workflow.

We can supply a list of:

  1. Patient first names (through custom_patient_first_names)

  2. Patient last names (through custom_patient_last_names)

  3. Doctor first names (through custom_dr_first_names)

  4. Doctor last names (through custom_dr_last_names)

See the example usage below:

[15]:
phi, new_string = deid_string(
    original_string,
    custom_patient_first_names={'Frodo'},
    custom_patient_last_names={'Baggins'}
)

display_deid(original_string, phi)
Frodo NAME Baggins NAME was born in Middle Earth.

Note that these custom namelists are supplied as Python Sets for fast lookup.

That details for all the above options are available through the function docstring:

[ ]:
deid_string?

Bulk De-identification

Many workflows require bulk de-identification of large CSVs containing notes with PHI. For this purpose we use pyDeid().

The most basic usage of the function only requires the user to supply the name of the file to be identified (original_file), the name of the column containing a unique identifier for the encounter (encounter_id_varname- in our applications this will generally be the genc_id), and the name of the column containing the note text to be de-identified (note_varname).

Note that if the original_file to de-identify contains multiple notes per encounter, an encounter_id_varname is not enough to uniquely identify a note. In these cases, please supply a note_id_varname in addition to the encounter_id_varname.

[17]:
pyDeid(
    original_file='../../tests/test.csv',
    encounter_id_varname='genc_id',
    note_id_varname='note_id',
    note_varname='note_text'
)
Processing encounter 3, note Record 3: : 3it [00:01,  2.71it/s]
Diagnostics:
                - chars/s = 147.39932995948666
                - s/note = 0.36861316363016766

Note that pyDeid accepts many other arguments, which can be seen in the function docstring:

[ ]:
pyDeid?

Putting It All Together

Note that just as in deid_string(), custom regular expressions can be supplied with named arguments through **custom_regexes, named entity recognition can be used through ner_pipeline, and custom patient and doctor names can be supplied through custom_{dr/patient}_{first/last}_names.

In the below example, we’ll combine all of pyDeid’s features:

  1. Custom namelists for patient first and last names.

  2. Named entity recognition with spaCy.

  3. Custom regular expressions.

By default, phi_output_file is saved as a csv. In this example however, we will output to json.

We will also specify a custom output filename for the de-identified file through new_file, and for the found PHI details through phi_output_file. If these names are not specified, they will default to {original filename without extension}__DEID.csv and {original filename without extension}__PHI.csv respectively.

First we read the “MLLs” into python Sets.

[19]:
import pandas as pd

MLL_filepath='../../tests/namelist.csv'
MLL = pd.read_csv(MLL_filepath)

MLL.head()
[19]:
first_name last_name
0 Frodo Baggins
1 Samwise NaN
[20]:
first_names = set(MLL.first_name)
last_names = set(MLL.last_name)
[21]:
pyDeid(
    # specify the name and format of the input file
    original_file='../../tests/basic_usage_tutorial.csv',
    encounter_id_varname='id',
    note_id_varname='note_id',
    note_varname='text',

    # specify the name of the de-identified result file
    new_file='../../tests/test_deid',

    # specify the name and format of the found PHI output file
    phi_output_file='../../tests/phi.json',
    phi_output_file_type='json',

    # use custom namelists since they are available
    custom_patient_first_names=first_names,
    custom_patient_last_names=last_names,

    # pass a spaCy NER pipeline to catch names outside our namelist
    ner_pipeline=nlp,

    # use custom regex for elven year format "yén"
    solar_years="\\d\syén"
)
Supplied custom regexes through **kwargs (see custom_regexes in docstring):

- solar_years : \d\syén

These custom patterns will be replaced with <PHI>.

Processing encounter 3, note Record 1: : 4it [00:01,  3.47it/s]
Diagnostics:
                - chars/s = 139.8456593424458
                - s/note = 0.2878172993659973