{
"cells": [
{
"cell_type": "markdown",
"id": "89a878db",
"metadata": {},
"source": [
"# Getting Started With `pyDeid`"
]
},
{
"cell_type": "markdown",
"id": "4312f594",
"metadata": {},
"source": [
"Follow the installation instructions in the [README.md](https://gitlab.smh.smhroot.net/geminidata/pydeid) before running this notebook."
]
},
{
"cell_type": "markdown",
"id": "3bb62534",
"metadata": {},
"source": [
"## Running a basic example"
]
},
{
"cell_type": "markdown",
"id": "a85e1fcb",
"metadata": {},
"source": [
"If running this notebook locally, without using a virtual environment- run the block of code below (note that `'/path/from/pip/show/pydeid'` can be found with `!pip show pydeid`):"
]
},
{
"cell_type": "raw",
"id": "55f01df0-2596-4388-b1a2-e7b50e73ebc2",
"metadata": {},
"source": [
"!pip show pydeid"
]
},
{
"cell_type": "raw",
"id": "29ffda4d-967c-4d74-8ba2-afa1bfc18a10",
"metadata": {},
"source": [
"import sys\n",
"sys.path.insert(0,'/path/from/pip/show/pydeid')"
]
},
{
"cell_type": "markdown",
"id": "f189e074",
"metadata": {},
"source": [
"For this demo, import the following functions from `pyDeid`.\n",
"\n",
"`deid_string()`, `reid_string()`, and `display_deid()` are useful shortcut functions for exploring `pyDeid` features, to test and debug, and may be useful to investigate errors if they occur during the bulk de-identification process described later in the tutorial."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "aea3ce8b",
"metadata": {},
"outputs": [],
"source": [
"from pyDeid import pyDeid, deid_string, reid_string, display_deid"
]
},
{
"cell_type": "markdown",
"id": "b01d71fc",
"metadata": {},
"source": [
"Test out the installation using the following example:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b2417be7",
"metadata": {},
"outputs": [],
"source": [
"original_string = 'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'\n",
"phi, new_string = deid_string(original_string)"
]
},
{
"cell_type": "markdown",
"id": "2f2953d4",
"metadata": {},
"source": [
"`deid_string()` takes as input a string, and outputs a `new_string()` with the PHI found in the original string replaced with surrogates, as well as a `phi` list of information regarding the found PHI:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6d068d0a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'phi_start': 0,\n",
" 'phi_end': 6,\n",
" 'phi': 'Elijah',\n",
" 'surrogate_start': 0,\n",
" 'surrogate_end': 4,\n",
" 'surrogate': 'Mary',\n",
" 'types': ['Male First Name (un)',\n",
" 'Last Name (un)',\n",
" 'First Name (followed by last name)']},\n",
" {'phi_start': 7,\n",
" 'phi_end': 11,\n",
" 'phi': 'Wood',\n",
" 'surrogate_start': 5,\n",
" 'surrogate_end': 12,\n",
" 'surrogate': 'Edwards',\n",
" 'types': ['Last Name (ambig)', 'Last Name (follows first name)']},\n",
" {'phi_start': 58,\n",
" 'phi_end': 75,\n",
" 'phi': Date(date_string='December 10, 2001', day='10', month='December', year='2001'),\n",
" 'surrogate_start': 59,\n",
" 'surrogate_end': 71,\n",
" 'surrogate': '2006-6th-Aug',\n",
" 'types': ['Month Day Year [Month-dd-yy(yy)]',\n",
" 'Month Day Year [Month dd, yy(yy)]',\n",
" 'Month Day Year [Month dd, yy(yy)]']}]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"phi"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "23f48672",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Mary Edwards starred in The Lord of the Rings, released on 2006-6th-Aug'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_string"
]
},
{
"cell_type": "markdown",
"id": "4ea6dbae",
"metadata": {},
"source": [
"## `pyDeid` Features"
]
},
{
"cell_type": "markdown",
"id": "c6293156",
"metadata": {},
"source": [
"`display_deid()` allows for visualization of the de-identification in interactive settings such as in Jupyter notebooks. This can be useful for demonstration and debugging:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "b6c64dba",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" Elijah\n",
" NAME\n",
"\n",
" \n",
"\n",
" Wood\n",
" NAME\n",
"\n",
" starred in The Lord of the Rings, released on \n",
"\n",
" December 10, 2001\n",
" DATE\n",
"\n",
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_deid(original_string, phi)"
]
},
{
"cell_type": "markdown",
"id": "517f83b3",
"metadata": {},
"source": [
"We can also re-identify the string to return back the original string:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b9fa216c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reid_string(new_string, phi)"
]
},
{
"cell_type": "markdown",
"id": "5f55979e",
"metadata": {},
"source": [
"Some contexts may have known custom patterns for PHI that we'd like to also de-identify. \n",
"\n",
"We can do this by supplying a custom regular expression as a named argument to `deid_string()` as follows:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0a60e1fd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Supplied custom regexes through **kwargs (see custom_regexes in docstring):\n",
"\n",
"- solar_years : \\d\\syén\n",
"\n",
"These custom patterns will be replaced with .\n",
"\n"
]
}
],
"source": [
"original_string = 'Elves use 1 yén to refer to 144 of our years'\n",
"\n",
"phi, new_string = deid_string(original_string, solar_years = '\\\\d\\syén')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "762ee3a7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Elves use to refer to 144 of our years'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_string"
]
},
{
"cell_type": "markdown",
"id": "4e6a8dad",
"metadata": {},
"source": [
"Note that the name of the argument that was supplied is used to identify the PHI type:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "7325f42c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'phi_start': 10,\n",
" 'phi_end': 15,\n",
" 'phi': '1 yén',\n",
" 'surrogate_start': 10,\n",
" 'surrogate_end': 15,\n",
" 'surrogate': '',\n",
" 'types': ['solar_years']}]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"phi"
]
},
{
"cell_type": "markdown",
"id": "22ab9daf",
"metadata": {},
"source": [
"Currently these \"custom\" regular expressions are replaced with `` placeholders, but in the future the user will be able to supply a custom replacement string generator function."
]
},
{
"cell_type": "markdown",
"id": "60d48532",
"metadata": {},
"source": [
"We also have the ability to use a `spaCy` named entity recognition pass on the string to identify any missed names. \n",
"\n",
"See how rare names are treated *without* named entity recognition:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "60b9a5a5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" Frodo\n",
" NAME\n",
"\n",
" \n",
"\n",
" Baggins\n",
" NAME\n",
"\n",
" was born in Middle Earth.
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"phi, new_string = deid_string(\n",
" original_string, \n",
" ner_pipeline=nlp\n",
")\n",
"\n",
"display_deid(original_string, phi)"
]
},
{
"cell_type": "markdown",
"id": "68967b47",
"metadata": {},
"source": [
"However, if we have access to patient and doctor names, we will not need to use named entity recognition in our workflow.\n",
"\n",
"We can supply a list of:\n",
"\n",
"1. Patient first names (through `custom_patient_first_names`)\n",
"2. Patient last names (through `custom_patient_last_names`)\n",
"3. Doctor first names (through `custom_dr_first_names`)\n",
"4. Doctor last names (through `custom_dr_last_names`)\n",
"\n",
"See the example usage below:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "407237a4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" Frodo\n",
" NAME\n",
"\n",
" \n",
"\n",
" Baggins\n",
" NAME\n",
"\n",
" was born in Middle Earth.
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"phi, new_string = deid_string(\n",
" original_string, \n",
" custom_patient_first_names={'Frodo'}, \n",
" custom_patient_last_names={'Baggins'}\n",
")\n",
"\n",
"display_deid(original_string, phi)"
]
},
{
"cell_type": "markdown",
"id": "30bfbd32",
"metadata": {},
"source": [
"Note that these custom namelists are supplied as Python `Sets` for fast lookup."
]
},
{
"cell_type": "markdown",
"id": "b31412cb",
"metadata": {},
"source": [
"That details for all the above options are available through the function docstring:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6515c57",
"metadata": {},
"outputs": [],
"source": [
"deid_string?"
]
},
{
"cell_type": "markdown",
"id": "6df8bc33",
"metadata": {},
"source": [
"## Bulk De-identification"
]
},
{
"cell_type": "markdown",
"id": "1593a3fa",
"metadata": {},
"source": [
"Many workflows require bulk de-identification of large CSVs containing notes with PHI. For this purpose we use `pyDeid()`."
]
},
{
"cell_type": "markdown",
"id": "113993b9",
"metadata": {},
"source": [
"The most basic usage of the function only requires the user to supply the name of the file to be identified (`original_file`), the name of the column containing a unique identifier for the encounter (`encounter_id_varname`- in our applications this will generally be the `genc_id`), and the name of the column containing the note text to be de-identified (`note_varname`).\n",
"\n",
"**Note** that if the `original_file` to de-identify contains multiple notes per encounter, an `encounter_id_varname` is not enough to uniquely identify a note. In these cases, please supply a `note_id_varname` in addition to the `encounter_id_varname`."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "20183349",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processing encounter 3, note Record 3: : 3it [00:01, 2.71it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Diagnostics:\n",
" - chars/s = 147.39932995948666\n",
" - s/note = 0.36861316363016766\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"pyDeid(\n",
" original_file='../../tests/test.csv',\n",
" encounter_id_varname='genc_id',\n",
" note_id_varname='note_id',\n",
" note_varname='note_text'\n",
")"
]
},
{
"cell_type": "markdown",
"id": "51ecbfd0",
"metadata": {},
"source": [
"Note that `pyDeid` accepts many other arguments, which can be seen in the function docstring:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4dc5d7b",
"metadata": {},
"outputs": [],
"source": [
"pyDeid?"
]
},
{
"cell_type": "markdown",
"id": "1d1cd1dc",
"metadata": {},
"source": [
"## Putting It All Together"
]
},
{
"cell_type": "markdown",
"id": "d660f01a",
"metadata": {},
"source": [
"Note that just as in `deid_string()`, custom regular expressions can be supplied with named arguments through `**custom_regexes`, named entity recognition can be used through `ner_pipeline`, and custom patient and doctor names can be supplied through `custom_{dr/patient}_{first/last}_names`."
]
},
{
"cell_type": "markdown",
"id": "04d691c1",
"metadata": {},
"source": [
"In the below example, we'll combine all of `pyDeid`'s features:\n",
"\n",
"1. Custom namelists for patient first and last names.\n",
"2. Named entity recognition with `spaCy`.\n",
"3. Custom regular expressions."
]
},
{
"cell_type": "markdown",
"id": "9b696f5d-f306-4e25-9f3b-2782bf73234b",
"metadata": {},
"source": [
"By default, `phi_output_file` is saved as a `csv`. In this example however, we will output to `json`.\n",
"\n",
"We will also specify a custom output filename for the de-identified file through `new_file`, and for the found PHI details through `phi_output_file`. If these names are not specified, they will default to `{original filename without extension}__DEID.csv` and `{original filename without extension}__PHI.csv` respectively."
]
},
{
"cell_type": "markdown",
"id": "aba7ee4e",
"metadata": {},
"source": [
"First we read the \"MLLs\" into python `Sets`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "5dd47c62",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
first_name
\n",
"
last_name
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Frodo
\n",
"
Baggins
\n",
"
\n",
"
\n",
"
1
\n",
"
Samwise
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" first_name last_name\n",
"0 Frodo Baggins\n",
"1 Samwise NaN"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"MLL_filepath='../../tests/namelist.csv'\n",
"MLL = pd.read_csv(MLL_filepath)\n",
"\n",
"MLL.head()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "e96756be",
"metadata": {},
"outputs": [],
"source": [
"first_names = set(MLL.first_name)\n",
"last_names = set(MLL.last_name)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "a1c77dc6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Supplied custom regexes through **kwargs (see custom_regexes in docstring):\n",
"\n",
"- solar_years : \\d\\syén\n",
"\n",
"These custom patterns will be replaced with .\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processing encounter 3, note Record 1: : 4it [00:01, 3.47it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Diagnostics:\n",
" - chars/s = 139.8456593424458\n",
" - s/note = 0.2878172993659973\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"pyDeid(\n",
" # specify the name and format of the input file\n",
" original_file='../../tests/basic_usage_tutorial.csv',\n",
" encounter_id_varname='id',\n",
" note_id_varname='note_id',\n",
" note_varname='text',\n",
" \n",
" # specify the name of the de-identified result file\n",
" new_file='../../tests/test_deid',\n",
" \n",
" # specify the name and format of the found PHI output file\n",
" phi_output_file='../../tests/phi.json',\n",
" phi_output_file_type='json',\n",
" \n",
" # use custom namelists since they are available\n",
" custom_patient_first_names=first_names,\n",
" custom_patient_last_names=last_names,\n",
" \n",
" # pass a spaCy NER pipeline to catch names outside our namelist\n",
" ner_pipeline=nlp,\n",
" \n",
" # use custom regex for elven year format \"yén\"\n",
" solar_years=\"\\\\d\\syén\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}