{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "89a878db",
   "metadata": {},
   "source": [
    "# Getting Started With `pyDeid`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4312f594",
   "metadata": {},
   "source": [
    "Follow the installation instructions in the [README.md](https://gitlab.smh.smhroot.net/geminidata/pydeid) before running this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bb62534",
   "metadata": {},
   "source": [
    "## Running a basic example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a85e1fcb",
   "metadata": {},
   "source": [
    "If running this notebook locally, without using a virtual environment- run the block of code below (note that `'/path/from/pip/show/pydeid'` can be found with `!pip show pydeid`):"
   ]
  },
  {
   "cell_type": "raw",
   "id": "55f01df0-2596-4388-b1a2-e7b50e73ebc2",
   "metadata": {},
   "source": [
    "!pip show pydeid"
   ]
  },
  {
   "cell_type": "raw",
   "id": "29ffda4d-967c-4d74-8ba2-afa1bfc18a10",
   "metadata": {},
   "source": [
    "import sys\n",
    "sys.path.insert(0,'/path/from/pip/show/pydeid')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f189e074",
   "metadata": {},
   "source": [
    "For this demo, import the following functions from `pyDeid`.\n",
    "\n",
    "`deid_string()`, `reid_string()`, and `display_deid()` are useful shortcut functions for exploring `pyDeid` features, to test and debug, and may be useful to investigate errors if they occur during the bulk de-identification process described later in the tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "aea3ce8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyDeid import pyDeid, deid_string, reid_string, display_deid"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b01d71fc",
   "metadata": {},
   "source": [
    "Test out the installation using the following example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b2417be7",
   "metadata": {},
   "outputs": [],
   "source": [
    "original_string = 'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'\n",
    "phi, new_string = deid_string(original_string)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f2953d4",
   "metadata": {},
   "source": [
    "`deid_string()` takes as input a string, and outputs a `new_string()` with the PHI found in the original string replaced with surrogates, as well as a `phi` list of information regarding the found PHI:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6d068d0a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'phi_start': 0,\n",
       "  'phi_end': 6,\n",
       "  'phi': 'Elijah',\n",
       "  'surrogate_start': 0,\n",
       "  'surrogate_end': 4,\n",
       "  'surrogate': 'Mary',\n",
       "  'types': ['Male First Name (un)',\n",
       "   'Last Name (un)',\n",
       "   'First Name (followed by last name)']},\n",
       " {'phi_start': 7,\n",
       "  'phi_end': 11,\n",
       "  'phi': 'Wood',\n",
       "  'surrogate_start': 5,\n",
       "  'surrogate_end': 12,\n",
       "  'surrogate': 'Edwards',\n",
       "  'types': ['Last Name (ambig)', 'Last Name (follows first name)']},\n",
       " {'phi_start': 58,\n",
       "  'phi_end': 75,\n",
       "  'phi': Date(date_string='December 10, 2001', day='10', month='December', year='2001'),\n",
       "  'surrogate_start': 59,\n",
       "  'surrogate_end': 71,\n",
       "  'surrogate': '2006-6th-Aug',\n",
       "  'types': ['Month Day Year [Month-dd-yy(yy)]',\n",
       "   'Month Day Year [Month dd, yy(yy)]',\n",
       "   'Month Day Year [Month dd, yy(yy)]']}]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "phi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "23f48672",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Mary Edwards starred in The Lord of the Rings, released on 2006-6th-Aug'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ea6dbae",
   "metadata": {},
   "source": [
    "## `pyDeid` Features"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6293156",
   "metadata": {},
   "source": [
    "`display_deid()` allows for visualization of the de-identification in interactive settings such as in Jupyter notebooks. This can be useful for demonstration and debugging:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b6c64dba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Elijah\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">NAME</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Wood\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">NAME</span>\n",
       "</mark>\n",
       " starred in The Lord of the Rings, released on \n",
       "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    December 10, 2001\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
       "</mark>\n",
       "</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_deid(original_string, phi)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "517f83b3",
   "metadata": {},
   "source": [
    "We can also re-identify the string to return back the original string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b9fa216c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Elijah Wood starred in The Lord of the Rings, released on December 10, 2001'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reid_string(new_string, phi)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f55979e",
   "metadata": {},
   "source": [
    "Some contexts may have known custom patterns for PHI that we'd like to also de-identify. \n",
    "\n",
    "We can do this by supplying a custom regular expression as a named argument to `deid_string()` as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0a60e1fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Supplied custom regexes through **kwargs (see custom_regexes in docstring):\n",
      "\n",
      "- solar_years : \\d\\syén\n",
      "\n",
      "These custom patterns will be replaced with <PHI>.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "original_string = 'Elves use 1 yén to refer to 144 of our years'\n",
    "\n",
    "phi, new_string = deid_string(original_string, solar_years = '\\\\d\\syén')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "762ee3a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Elves use <PHI> to refer to 144 of our years'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e6a8dad",
   "metadata": {},
   "source": [
    "Note that the name of the argument that was supplied is used to identify the PHI type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7325f42c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'phi_start': 10,\n",
       "  'phi_end': 15,\n",
       "  'phi': '1 yén',\n",
       "  'surrogate_start': 10,\n",
       "  'surrogate_end': 15,\n",
       "  'surrogate': '<PHI>',\n",
       "  'types': ['solar_years']}]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "phi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22ab9daf",
   "metadata": {},
   "source": [
    "Currently these \"custom\" regular expressions are replaced with `<PHI>` placeholders, but in the future the user will be able to supply a custom replacement string generator function."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60d48532",
   "metadata": {},
   "source": [
    "We also have the ability to use a `spaCy` named entity recognition pass on the string to identify any missed names. \n",
    "\n",
    "See how rare names are treated *without* named entity recognition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "60b9a5a5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">Frodo Baggins was born in Middle Earth.</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "original_string = 'Frodo Baggins was born in Middle Earth.'\n",
    "\n",
    "phi, new_string = deid_string(original_string)\n",
    "\n",
    "display_deid(original_string, phi)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28fe56e6",
   "metadata": {},
   "source": [
    "And *with* named entity recognition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "6bf7e72d-c08b-4b08-b8a8-bd4858004948",
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "\n",
    "nlp = spacy.load(\"en_core_web_lg\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "30156f62",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Frodo\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">NAME</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Baggins\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">NAME</span>\n",
       "</mark>\n",
       " was born in Middle Earth.</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "phi, new_string = deid_string(\n",
    "    original_string, \n",
    "    ner_pipeline=nlp\n",
    ")\n",
    "\n",
    "display_deid(original_string, phi)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68967b47",
   "metadata": {},
   "source": [
    "However, if we have access to patient and doctor names, we will not need to use named entity recognition in our workflow.\n",
    "\n",
    "We can supply a list of:\n",
    "\n",
    "1. Patient first names (through `custom_patient_first_names`)\n",
    "2. Patient last names (through `custom_patient_last_names`)\n",
    "3. Doctor first names (through `custom_dr_first_names`)\n",
    "4. Doctor last names (through `custom_dr_last_names`)\n",
    "\n",
    "See the example usage below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "407237a4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Frodo\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">NAME</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Baggins\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">NAME</span>\n",
       "</mark>\n",
       " was born in Middle Earth.</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "phi, new_string = deid_string(\n",
    "    original_string, \n",
    "    custom_patient_first_names={'Frodo'}, \n",
    "    custom_patient_last_names={'Baggins'}\n",
    ")\n",
    "\n",
    "display_deid(original_string, phi)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30bfbd32",
   "metadata": {},
   "source": [
    "Note that these custom namelists are supplied as Python `Sets` for fast lookup."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b31412cb",
   "metadata": {},
   "source": [
    "That details for all the above options are available through the function docstring:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6515c57",
   "metadata": {},
   "outputs": [],
   "source": [
    "deid_string?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6df8bc33",
   "metadata": {},
   "source": [
    "## Bulk De-identification"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1593a3fa",
   "metadata": {},
   "source": [
    "Many workflows require bulk de-identification of large CSVs containing notes with PHI. For this purpose we use `pyDeid()`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "113993b9",
   "metadata": {},
   "source": [
    "The most basic usage of the function only requires the user to supply the name of the file to be identified (`original_file`), the name of the column containing a unique identifier for the encounter (`encounter_id_varname`- in our applications this will generally be the `genc_id`), and the name of the column containing the note text to be de-identified (`note_varname`).\n",
    "\n",
    "**Note** that if the `original_file` to de-identify contains multiple notes per encounter, an `encounter_id_varname` is not enough to uniquely identify a note. In these cases, please supply a `note_id_varname` in addition to the `encounter_id_varname`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "20183349",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processing encounter 3, note Record 3: : 3it [00:01,  2.71it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Diagnostics:\n",
      "                - chars/s = 147.39932995948666\n",
      "                - s/note = 0.36861316363016766\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "pyDeid(\n",
    "    original_file='../../tests/test.csv',\n",
    "    encounter_id_varname='genc_id',\n",
    "    note_id_varname='note_id',\n",
    "    note_varname='note_text'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51ecbfd0",
   "metadata": {},
   "source": [
    "Note that `pyDeid` accepts many other arguments, which can be seen in the function docstring:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4dc5d7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "pyDeid?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d1cd1dc",
   "metadata": {},
   "source": [
    "## Putting It All Together"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d660f01a",
   "metadata": {},
   "source": [
    "Note that just as in `deid_string()`, custom regular expressions can be supplied with named arguments through `**custom_regexes`,  named entity recognition can be used through `ner_pipeline`, and custom patient and doctor names can be supplied through `custom_{dr/patient}_{first/last}_names`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04d691c1",
   "metadata": {},
   "source": [
    "In the below example, we'll combine all of `pyDeid`'s features:\n",
    "\n",
    "1. Custom namelists for patient first and last names.\n",
    "2. Named entity recognition with `spaCy`.\n",
    "3. Custom regular expressions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b696f5d-f306-4e25-9f3b-2782bf73234b",
   "metadata": {},
   "source": [
    "By default, `phi_output_file` is saved as a `csv`. In this example however, we will output to `json`.\n",
    "\n",
    "We will also specify a custom output filename for the de-identified file through `new_file`, and for the found PHI details through `phi_output_file`. If these names are not specified, they will default to `{original filename without extension}__DEID.csv` and `{original filename without extension}__PHI.csv` respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aba7ee4e",
   "metadata": {},
   "source": [
    "First we read the \"MLLs\" into python `Sets`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "5dd47c62",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Frodo</td>\n",
       "      <td>Baggins</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Samwise</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name\n",
       "0      Frodo   Baggins\n",
       "1    Samwise       NaN"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "MLL_filepath='../../tests/namelist.csv'\n",
    "MLL = pd.read_csv(MLL_filepath)\n",
    "\n",
    "MLL.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "e96756be",
   "metadata": {},
   "outputs": [],
   "source": [
    "first_names = set(MLL.first_name)\n",
    "last_names = set(MLL.last_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "a1c77dc6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Supplied custom regexes through **kwargs (see custom_regexes in docstring):\n",
      "\n",
      "- solar_years : \\d\\syén\n",
      "\n",
      "These custom patterns will be replaced with <PHI>.\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processing encounter 3, note Record 1: : 4it [00:01,  3.47it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Diagnostics:\n",
      "                - chars/s = 139.8456593424458\n",
      "                - s/note = 0.2878172993659973\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "pyDeid(\n",
    "    # specify the name and format of the input file\n",
    "    original_file='../../tests/basic_usage_tutorial.csv',\n",
    "    encounter_id_varname='id',\n",
    "    note_id_varname='note_id',\n",
    "    note_varname='text',\n",
    "    \n",
    "    # specify the name of the de-identified result file\n",
    "    new_file='../../tests/test_deid',\n",
    "    \n",
    "    # specify the name and format of the found PHI output file\n",
    "    phi_output_file='../../tests/phi.json',\n",
    "    phi_output_file_type='json',\n",
    "    \n",
    "    # use custom namelists since they are available\n",
    "    custom_patient_first_names=first_names,\n",
    "    custom_patient_last_names=last_names,\n",
    "    \n",
    "    # pass a spaCy NER pipeline to catch names outside our namelist\n",
    "    ner_pipeline=nlp,\n",
    "    \n",
    "    # use custom regex for elven year format \"yén\"\n",
    "    solar_years=\"\\\\d\\syén\"\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}