Getting Started With pyDeid
¶
Follow the installation instructions in the README.md before running this notebook.
Running a basic example¶
If running this notebook locally, without using a virtual environment- run the block of code below (note that '/path/from/pip/show/pydeid'
can be found with !pip show pydeid
):
For this demo, import the following functions from pyDeid
.
deid_string()
, reid_string()
, and display_deid()
are useful shortcut functions for exploring pyDeid
features, to test and debug, and may be useful to investigate errors if they occur during the bulk de-identification process described later in the tutorial.
[3]:
from pyDeid import pyDeid, deid_string, reid_string, display_deid
Test out the installation using the following example:
[4]:
original_string = 'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'
phi, new_string = deid_string(original_string)
deid_string()
takes as input a string, and outputs a new_string()
with the PHI found in the original string replaced with surrogates, as well as a phi
list of information regarding the found PHI:
[5]:
phi
[5]:
[{'phi_start': 0,
'phi_end': 6,
'phi': 'Elijah',
'surrogate_start': 0,
'surrogate_end': 4,
'surrogate': 'Mary',
'types': ['Male First Name (un)',
'Last Name (un)',
'First Name (followed by last name)']},
{'phi_start': 7,
'phi_end': 11,
'phi': 'Wood',
'surrogate_start': 5,
'surrogate_end': 12,
'surrogate': 'Edwards',
'types': ['Last Name (ambig)', 'Last Name (follows first name)']},
{'phi_start': 58,
'phi_end': 75,
'phi': Date(date_string='December 10, 2001', day='10', month='December', year='2001'),
'surrogate_start': 59,
'surrogate_end': 71,
'surrogate': '2006-6th-Aug',
'types': ['Month Day Year [Month-dd-yy(yy)]',
'Month Day Year [Month dd, yy(yy)]',
'Month Day Year [Month dd, yy(yy)]']}]
[6]:
new_string
[6]:
'Mary Edwards starred in The Lord of the Rings, released on 2006-6th-Aug'
pyDeid
Features¶
display_deid()
allows for visualization of the de-identification in interactive settings such as in Jupyter notebooks. This can be useful for demonstration and debugging:
[7]:
display_deid(original_string, phi)
We can also re-identify the string to return back the original string:
[8]:
reid_string(new_string, phi)
[8]:
'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'
Some contexts may have known custom patterns for PHI that we’d like to also de-identify.
We can do this by supplying a custom regular expression as a named argument to deid_string()
as follows:
[9]:
original_string = 'Elves use 1 yén to refer to 144 of our years'
phi, new_string = deid_string(original_string, solar_years = '\\d\syén')
Supplied custom regexes through **kwargs (see custom_regexes in docstring):
- solar_years : \d\syén
These custom patterns will be replaced with <PHI>.
[10]:
new_string
[10]:
'Elves use <PHI> to refer to 144 of our years'
Note that the name of the argument that was supplied is used to identify the PHI type:
[11]:
phi
[11]:
[{'phi_start': 10,
'phi_end': 15,
'phi': '1 yén',
'surrogate_start': 10,
'surrogate_end': 15,
'surrogate': '<PHI>',
'types': ['solar_years']}]
Currently these “custom” regular expressions are replaced with <PHI>
placeholders, but in the future the user will be able to supply a custom replacement string generator function.
We also have the ability to use a spaCy
named entity recognition pass on the string to identify any missed names.
See how rare names are treated without named entity recognition:
[12]:
original_string = 'Frodo Baggins was born in Middle Earth.'
phi, new_string = deid_string(original_string)
display_deid(original_string, phi)
And with named entity recognition:
[13]:
import spacy
nlp = spacy.load("en_core_web_lg")
[14]:
phi, new_string = deid_string(
original_string,
ner_pipeline=nlp
)
display_deid(original_string, phi)
However, if we have access to patient and doctor names, we will not need to use named entity recognition in our workflow.
We can supply a list of:
Patient first names (through
custom_patient_first_names
)Patient last names (through
custom_patient_last_names
)Doctor first names (through
custom_dr_first_names
)Doctor last names (through
custom_dr_last_names
)
See the example usage below:
[15]:
phi, new_string = deid_string(
original_string,
custom_patient_first_names={'Frodo'},
custom_patient_last_names={'Baggins'}
)
display_deid(original_string, phi)
Note that these custom namelists are supplied as Python Sets
for fast lookup.
That details for all the above options are available through the function docstring:
[ ]:
deid_string?
Bulk De-identification¶
Many workflows require bulk de-identification of large CSVs containing notes with PHI. For this purpose we use pyDeid()
.
The most basic usage of the function only requires the user to supply the name of the file to be identified (original_file
), the name of the column containing a unique identifier for the encounter (encounter_id_varname
- in our applications this will generally be the genc_id
), and the name of the column containing the note text to be de-identified (note_varname
).
Note that if the original_file
to de-identify contains multiple notes per encounter, an encounter_id_varname
is not enough to uniquely identify a note. In these cases, please supply a note_id_varname
in addition to the encounter_id_varname
.
[17]:
pyDeid(
original_file='../../tests/test.csv',
encounter_id_varname='genc_id',
note_id_varname='note_id',
note_varname='note_text'
)
Processing encounter 3, note Record 3: : 3it [00:01, 2.71it/s]
Diagnostics:
- chars/s = 147.39932995948666
- s/note = 0.36861316363016766
Note that pyDeid
accepts many other arguments, which can be seen in the function docstring:
[ ]:
pyDeid?
Putting It All Together¶
Note that just as in deid_string()
, custom regular expressions can be supplied with named arguments through **custom_regexes
, named entity recognition can be used through ner_pipeline
, and custom patient and doctor names can be supplied through custom_{dr/patient}_{first/last}_names
.
In the below example, we’ll combine all of pyDeid
’s features:
Custom namelists for patient first and last names.
Named entity recognition with
spaCy
.Custom regular expressions.
By default, phi_output_file
is saved as a csv
. In this example however, we will output to json
.
We will also specify a custom output filename for the de-identified file through new_file
, and for the found PHI details through phi_output_file
. If these names are not specified, they will default to {original filename without extension}__DEID.csv
and {original filename without extension}__PHI.csv
respectively.
First we read the “MLLs” into python Sets
.
[19]:
import pandas as pd
MLL_filepath='../../tests/namelist.csv'
MLL = pd.read_csv(MLL_filepath)
MLL.head()
[19]:
first_name | last_name | |
---|---|---|
0 | Frodo | Baggins |
1 | Samwise | NaN |
[20]:
first_names = set(MLL.first_name)
last_names = set(MLL.last_name)
[21]:
pyDeid(
# specify the name and format of the input file
original_file='../../tests/basic_usage_tutorial.csv',
encounter_id_varname='id',
note_id_varname='note_id',
note_varname='text',
# specify the name of the de-identified result file
new_file='../../tests/test_deid',
# specify the name and format of the found PHI output file
phi_output_file='../../tests/phi.json',
phi_output_file_type='json',
# use custom namelists since they are available
custom_patient_first_names=first_names,
custom_patient_last_names=last_names,
# pass a spaCy NER pipeline to catch names outside our namelist
ner_pipeline=nlp,
# use custom regex for elven year format "yén"
solar_years="\\d\syén"
)
Supplied custom regexes through **kwargs (see custom_regexes in docstring):
- solar_years : \d\syén
These custom patterns will be replaced with <PHI>.
Processing encounter 3, note Record 1: : 4it [00:01, 3.47it/s]
Diagnostics:
- chars/s = 139.8456593424458
- s/note = 0.2878172993659973