{ "cells": [ { "cell_type": "markdown", "id": "89a878db", "metadata": {}, "source": [ "# Getting Started With `pyDeid`" ] }, { "cell_type": "markdown", "id": "4312f594", "metadata": {}, "source": [ "Follow the installation instructions in the [README.md](https://gitlab.smh.smhroot.net/geminidata/pydeid) before running this notebook." ] }, { "cell_type": "markdown", "id": "3bb62534", "metadata": {}, "source": [ "## Running a basic example" ] }, { "cell_type": "markdown", "id": "a85e1fcb", "metadata": {}, "source": [ "If running this notebook locally, without using a virtual environment- run the block of code below (note that `'/path/from/pip/show/pydeid'` can be found with `!pip show pydeid`):" ] }, { "cell_type": "raw", "id": "55f01df0-2596-4388-b1a2-e7b50e73ebc2", "metadata": {}, "source": [ "!pip show pydeid" ] }, { "cell_type": "raw", "id": "29ffda4d-967c-4d74-8ba2-afa1bfc18a10", "metadata": {}, "source": [ "import sys\n", "sys.path.insert(0,'/path/from/pip/show/pydeid')" ] }, { "cell_type": "markdown", "id": "f189e074", "metadata": {}, "source": [ "For this demo, import the following functions from `pyDeid`.\n", "\n", "`deid_string()`, `reid_string()`, and `display_deid()` are useful shortcut functions for exploring `pyDeid` features, to test and debug, and may be useful to investigate errors if they occur during the bulk de-identification process described later in the tutorial." ] }, { "cell_type": "code", "execution_count": 3, "id": "aea3ce8b", "metadata": {}, "outputs": [], "source": [ "from pyDeid import pyDeid, deid_string, reid_string, display_deid" ] }, { "cell_type": "markdown", "id": "b01d71fc", "metadata": {}, "source": [ "Test out the installation using the following example:" ] }, { "cell_type": "code", "execution_count": 4, "id": "b2417be7", "metadata": {}, "outputs": [], "source": [ "original_string = 'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'\n", "phi, new_string = deid_string(original_string)" ] }, { "cell_type": "markdown", "id": "2f2953d4", "metadata": {}, "source": [ "`deid_string()` takes as input a string, and outputs a `new_string()` with the PHI found in the original string replaced with surrogates, as well as a `phi` list of information regarding the found PHI:" ] }, { "cell_type": "code", "execution_count": 5, "id": "6d068d0a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'phi_start': 0,\n", " 'phi_end': 6,\n", " 'phi': 'Elijah',\n", " 'surrogate_start': 0,\n", " 'surrogate_end': 4,\n", " 'surrogate': 'Mary',\n", " 'types': ['Male First Name (un)',\n", " 'Last Name (un)',\n", " 'First Name (followed by last name)']},\n", " {'phi_start': 7,\n", " 'phi_end': 11,\n", " 'phi': 'Wood',\n", " 'surrogate_start': 5,\n", " 'surrogate_end': 12,\n", " 'surrogate': 'Edwards',\n", " 'types': ['Last Name (ambig)', 'Last Name (follows first name)']},\n", " {'phi_start': 58,\n", " 'phi_end': 75,\n", " 'phi': Date(date_string='December 10, 2001', day='10', month='December', year='2001'),\n", " 'surrogate_start': 59,\n", " 'surrogate_end': 71,\n", " 'surrogate': '2006-6th-Aug',\n", " 'types': ['Month Day Year [Month-dd-yy(yy)]',\n", " 'Month Day Year [Month dd, yy(yy)]',\n", " 'Month Day Year [Month dd, yy(yy)]']}]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "phi" ] }, { "cell_type": "code", "execution_count": 6, "id": "23f48672", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Mary Edwards starred in The Lord of the Rings, released on 2006-6th-Aug'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_string" ] }, { "cell_type": "markdown", "id": "4ea6dbae", "metadata": {}, "source": [ "## `pyDeid` Features" ] }, { "cell_type": "markdown", "id": "c6293156", "metadata": {}, "source": [ "`display_deid()` allows for visualization of the de-identification in interactive settings such as in Jupyter notebooks. This can be useful for demonstration and debugging:" ] }, { "cell_type": "code", "execution_count": 7, "id": "b6c64dba", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Elijah\n", " NAME\n", "\n", " \n", "\n", " Wood\n", " NAME\n", "\n", " starred in The Lord of the Rings, released on \n", "\n", " December 10, 2001\n", " DATE\n", "\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_deid(original_string, phi)" ] }, { "cell_type": "markdown", "id": "517f83b3", "metadata": {}, "source": [ "We can also re-identify the string to return back the original string:" ] }, { "cell_type": "code", "execution_count": 8, "id": "b9fa216c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reid_string(new_string, phi)" ] }, { "cell_type": "markdown", "id": "5f55979e", "metadata": {}, "source": [ "Some contexts may have known custom patterns for PHI that we'd like to also de-identify. \n", "\n", "We can do this by supplying a custom regular expression as a named argument to `deid_string()` as follows:" ] }, { "cell_type": "code", "execution_count": 9, "id": "0a60e1fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Supplied custom regexes through **kwargs (see custom_regexes in docstring):\n", "\n", "- solar_years : \\d\\syén\n", "\n", "These custom patterns will be replaced with .\n", "\n" ] } ], "source": [ "original_string = 'Elves use 1 yén to refer to 144 of our years'\n", "\n", "phi, new_string = deid_string(original_string, solar_years = '\\\\d\\syén')" ] }, { "cell_type": "code", "execution_count": 10, "id": "762ee3a7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Elves use to refer to 144 of our years'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_string" ] }, { "cell_type": "markdown", "id": "4e6a8dad", "metadata": {}, "source": [ "Note that the name of the argument that was supplied is used to identify the PHI type:" ] }, { "cell_type": "code", "execution_count": 11, "id": "7325f42c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'phi_start': 10,\n", " 'phi_end': 15,\n", " 'phi': '1 yén',\n", " 'surrogate_start': 10,\n", " 'surrogate_end': 15,\n", " 'surrogate': '',\n", " 'types': ['solar_years']}]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "phi" ] }, { "cell_type": "markdown", "id": "22ab9daf", "metadata": {}, "source": [ "Currently these \"custom\" regular expressions are replaced with `` placeholders, but in the future the user will be able to supply a custom replacement string generator function." ] }, { "cell_type": "markdown", "id": "60d48532", "metadata": {}, "source": [ "We also have the ability to use a `spaCy` named entity recognition pass on the string to identify any missed names. \n", "\n", "See how rare names are treated *without* named entity recognition:" ] }, { "cell_type": "code", "execution_count": 12, "id": "60b9a5a5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Frodo Baggins was born in Middle Earth.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "original_string = 'Frodo Baggins was born in Middle Earth.'\n", "\n", "phi, new_string = deid_string(original_string)\n", "\n", "display_deid(original_string, phi)" ] }, { "cell_type": "markdown", "id": "28fe56e6", "metadata": {}, "source": [ "And *with* named entity recognition:" ] }, { "cell_type": "code", "execution_count": 13, "id": "6bf7e72d-c08b-4b08-b8a8-bd4858004948", "metadata": {}, "outputs": [], "source": [ "import spacy\n", "\n", "nlp = spacy.load(\"en_core_web_lg\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "30156f62", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Frodo\n", " NAME\n", "\n", " \n", "\n", " Baggins\n", " NAME\n", "\n", " was born in Middle Earth.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "phi, new_string = deid_string(\n", " original_string, \n", " ner_pipeline=nlp\n", ")\n", "\n", "display_deid(original_string, phi)" ] }, { "cell_type": "markdown", "id": "68967b47", "metadata": {}, "source": [ "However, if we have access to patient and doctor names, we will not need to use named entity recognition in our workflow.\n", "\n", "We can supply a list of:\n", "\n", "1. Patient first names (through `custom_patient_first_names`)\n", "2. Patient last names (through `custom_patient_last_names`)\n", "3. Doctor first names (through `custom_dr_first_names`)\n", "4. Doctor last names (through `custom_dr_last_names`)\n", "\n", "See the example usage below:" ] }, { "cell_type": "code", "execution_count": 15, "id": "407237a4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Frodo\n", " NAME\n", "\n", " \n", "\n", " Baggins\n", " NAME\n", "\n", " was born in Middle Earth.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "phi, new_string = deid_string(\n", " original_string, \n", " custom_patient_first_names={'Frodo'}, \n", " custom_patient_last_names={'Baggins'}\n", ")\n", "\n", "display_deid(original_string, phi)" ] }, { "cell_type": "markdown", "id": "30bfbd32", "metadata": {}, "source": [ "Note that these custom namelists are supplied as Python `Sets` for fast lookup." ] }, { "cell_type": "markdown", "id": "b31412cb", "metadata": {}, "source": [ "That details for all the above options are available through the function docstring:" ] }, { "cell_type": "code", "execution_count": null, "id": "f6515c57", "metadata": {}, "outputs": [], "source": [ "deid_string?" ] }, { "cell_type": "markdown", "id": "6df8bc33", "metadata": {}, "source": [ "## Bulk De-identification" ] }, { "cell_type": "markdown", "id": "1593a3fa", "metadata": {}, "source": [ "Many workflows require bulk de-identification of large CSVs containing notes with PHI. For this purpose we use `pyDeid()`." ] }, { "cell_type": "markdown", "id": "113993b9", "metadata": {}, "source": [ "The most basic usage of the function only requires the user to supply the name of the file to be identified (`original_file`), the name of the column containing a unique identifier for the encounter (`encounter_id_varname`- in our applications this will generally be the `genc_id`), and the name of the column containing the note text to be de-identified (`note_varname`).\n", "\n", "**Note** that if the `original_file` to de-identify contains multiple notes per encounter, an `encounter_id_varname` is not enough to uniquely identify a note. In these cases, please supply a `note_id_varname` in addition to the `encounter_id_varname`." ] }, { "cell_type": "code", "execution_count": 17, "id": "20183349", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Processing encounter 3, note Record 3: : 3it [00:01, 2.71it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Diagnostics:\n", " - chars/s = 147.39932995948666\n", " - s/note = 0.36861316363016766\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "pyDeid(\n", " original_file='../../tests/test.csv',\n", " encounter_id_varname='genc_id',\n", " note_id_varname='note_id',\n", " note_varname='note_text'\n", ")" ] }, { "cell_type": "markdown", "id": "51ecbfd0", "metadata": {}, "source": [ "Note that `pyDeid` accepts many other arguments, which can be seen in the function docstring:" ] }, { "cell_type": "code", "execution_count": null, "id": "c4dc5d7b", "metadata": {}, "outputs": [], "source": [ "pyDeid?" ] }, { "cell_type": "markdown", "id": "1d1cd1dc", "metadata": {}, "source": [ "## Putting It All Together" ] }, { "cell_type": "markdown", "id": "d660f01a", "metadata": {}, "source": [ "Note that just as in `deid_string()`, custom regular expressions can be supplied with named arguments through `**custom_regexes`, named entity recognition can be used through `ner_pipeline`, and custom patient and doctor names can be supplied through `custom_{dr/patient}_{first/last}_names`." ] }, { "cell_type": "markdown", "id": "04d691c1", "metadata": {}, "source": [ "In the below example, we'll combine all of `pyDeid`'s features:\n", "\n", "1. Custom namelists for patient first and last names.\n", "2. Named entity recognition with `spaCy`.\n", "3. Custom regular expressions." ] }, { "cell_type": "markdown", "id": "9b696f5d-f306-4e25-9f3b-2782bf73234b", "metadata": {}, "source": [ "By default, `phi_output_file` is saved as a `csv`. In this example however, we will output to `json`.\n", "\n", "We will also specify a custom output filename for the de-identified file through `new_file`, and for the found PHI details through `phi_output_file`. If these names are not specified, they will default to `{original filename without extension}__DEID.csv` and `{original filename without extension}__PHI.csv` respectively." ] }, { "cell_type": "markdown", "id": "aba7ee4e", "metadata": {}, "source": [ "First we read the \"MLLs\" into python `Sets`." ] }, { "cell_type": "code", "execution_count": 19, "id": "5dd47c62", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_namelast_name
0FrodoBaggins
1SamwiseNaN
\n", "
" ], "text/plain": [ " first_name last_name\n", "0 Frodo Baggins\n", "1 Samwise NaN" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "MLL_filepath='../../tests/namelist.csv'\n", "MLL = pd.read_csv(MLL_filepath)\n", "\n", "MLL.head()" ] }, { "cell_type": "code", "execution_count": 20, "id": "e96756be", "metadata": {}, "outputs": [], "source": [ "first_names = set(MLL.first_name)\n", "last_names = set(MLL.last_name)" ] }, { "cell_type": "code", "execution_count": 21, "id": "a1c77dc6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Supplied custom regexes through **kwargs (see custom_regexes in docstring):\n", "\n", "- solar_years : \\d\\syén\n", "\n", "These custom patterns will be replaced with .\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Processing encounter 3, note Record 1: : 4it [00:01, 3.47it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Diagnostics:\n", " - chars/s = 139.8456593424458\n", " - s/note = 0.2878172993659973\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "pyDeid(\n", " # specify the name and format of the input file\n", " original_file='../../tests/basic_usage_tutorial.csv',\n", " encounter_id_varname='id',\n", " note_id_varname='note_id',\n", " note_varname='text',\n", " \n", " # specify the name of the de-identified result file\n", " new_file='../../tests/test_deid',\n", " \n", " # specify the name and format of the found PHI output file\n", " phi_output_file='../../tests/phi.json',\n", " phi_output_file_type='json',\n", " \n", " # use custom namelists since they are available\n", " custom_patient_first_names=first_names,\n", " custom_patient_last_names=last_names,\n", " \n", " # pass a spaCy NER pipeline to catch names outside our namelist\n", " ner_pipeline=nlp,\n", " \n", " # use custom regex for elven year format \"yén\"\n", " solar_years=\"\\\\d\\syén\"\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }