PyDeid
¶
See read the docs for extensive examples and a detailed feature breakdown.
Note that ``pyDeid`` was tested with python ``3.10.4``.
1 Objective¶
PyDeid
is a Python-based refactor of the Physionet Perl-based
De-identification Software,
which uses regular expressions and lookup dictionaries to identify and
replace PHI in free text.
The purpose was to create a faster, and easier to use tool to
de-identify large volumes of (large) clinical notes. PyDeid
is up to
5.4x faster than the perl-based software, and requires no pre or post
processing. Additional enhancements include the ability to re-identify
text, supply custom patterns, and supply custom doctor and patient
namelists without having to save them to persistent storage, and
incorporate spaCy
NER pipelines.
2 Local Installation and Setup¶
If you have downloaded (and unzipped) this package to a local folder
/path/to/package/
simply install via
pip3 install --user /path/to/package/
.
Note that if you are connected through VPN to your institution’s network
you may need to run the command with the following options:
pip3 install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org --user /path/to/package/
.
Dependencies can be installed with
pip3 install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org -r /path/to/requirements.txt/
.
Note: You may encounter a deprecation warning about building local packages in place without first copying to a temporary directory in a future version of
pip
. Note thatpyDeid
has been tested with this change and the installation will continue to work when this behaviour becomes the default.
To now import from this package, you may need to add the install
location (found via pip show pyDeid
) to your $PYTHONPATH
so
Python
knows where to look for it:
import sys
sys.path.append('/path/from/pip_show_pydeid/')
Test your installation by running the following:
from pyDeid import deid_string
deid_string('Elijah Wood starred in The Lord of the Rings, released on 10 December 2001.')
3 CSV File De-identification¶
pyDeid
only requires that the free text data be stored in a CSV file
with named column headers.
Specifying a column of encounter IDs (and optionally note IDs) with
encounter_id_varname
and note_id_varname
respectively is
relevant in only to assign the found PHI to a particular encounter and
note identifier for further analysis.
There may be multiple columns containing free text, but PyDeid
will
only de-identify a single column at a time. In order to de-identify
multiple columns of text, make multiple passes of the same file with
pyDeid
on the same file.
Consider a test.csv
that looks like this (and is provided under the
tests/
directory of the package):
genc_id,note_id,note_text
1,Record 1,"Elijah Wood starred in The Lord of the Rings, released on 10 December 2001."
2,Record 2,"St.Michael's hospital is located at 30 Bond St, Toronto, ON, M5B 1W8
"
3,Record 3,Test MRN: 011-0111
Note that each note (in the note_text
field) is uniquely identified
by a genc_id
and a note_id
.
To de-identify this file with default settings, simply supply:
The filename as
original_file
The name of the column containing the note as
note_varname
The name of the encounter identifying column as
encounter_id_varname
And the name of the note (within encounter) identifying column as
note_id_varname
from pyDeid import pyDeid
pyDeid(
original_file = ‘test.csv’,
note_varname = ‘note_text’,
encounter_id_varname = ‘genc_id’,
note_id_varname = 'note_id'
)
4 Additional Arguments¶
Additional settings are described in the API documentation, and covered extensively in the tutorial notebooks.
Some arguments are described below:
A file name for the de-identified csv can be supplied with
new_file
, or will take the form<original file name>__DE-IDENTIFIED.csv
.The PHI output can be stored as a
json
orcsv
file (specified by thephi_output_file_type
parameter), and optionally the file name can be provided using thephi_output_file
parameter. Thejson
data structure is more efficient on space, but is much heavier on memory. Thecsv
data structure is inefficient on space but much lighter on memory.There is a
verbose
mode which displays a progress bar and prints speed diagnostics at the end of the run.
One of the main advantages of PyDeid
over
Physionet De-identification Software v1.1
is the ability to supply a
set of doctor and patient first and last names to be recognized as PHI,
without having to write this sensitive information to persistent
storage. These custom blacklists can be read into a python Set and
supplied to PyDeid
through the custom_dr_first_names
,
custom_dr_last_names
, custom_patient_first_names
, and
custom_patient_last_names
parameters for better recall and
precision.
Additionally, PyDeid
allows custom regexes for site-specific PHI to
be supplied through keyword arguments. Simply supply a named regex
argument to PyDeid
or deid_string
like so:
deid_string(
‘The site-specific identifier at your hospital is NH12345’,
site_identifier = ‘NH\d{5}’
)
5 Additional Functions¶
deid_string
is a simplified function to de-identify a single string
at a time which can be used for debugging, or written into a custom
wrapper if PyDeid
fails to meet the requirements of the problem. It
can optionally be combined with the reid_string
function which takes
the output of deid_string
and returns the input to deid_string
.
from pyDeid import reid_string
original_string = 'Elijah Wood starred in The Lord of the Rings, released on 10 December 2001.'
phi, new_string = deid_string(original_string)
reidentified_string = reid_string(new_string, phi)
print(original_string == reidentified_string)
Additionally, the de-identification results can be visualized with
spaCy
.
from pyDeid import display_deid
display_deid(original_string, phi)
6 Reporting Issues¶
Please report any bugs or feature requests as issues to the PyDeid
repository. For any bugs, please supply a minimal reproducible example
to guarantee a quicker resolution.