{ "cells": [ { "cell_type": "markdown", "id": "02cd1fa3-be0f-4734-9310-1b746b36f22d", "metadata": {}, "source": [ "# Evaluating `pyDeid`" ] }, { "cell_type": "markdown", "id": "6aae382e-b48a-464a-96c0-a9cb0097d42d", "metadata": {}, "source": [ "We show how to evaluate `pyDeid` on a custom dataset of Canadian admission notes, as described in the paper\n", "\n", "> pyDeid: An Improved, Fast, Flexible, and Generalizable Rules-based Approach for De-identification of Free-text Medical Records\n", "\n", "As well as on the popular `n2c2` benchmark dataset of American discharge notes, using the [ETUDE engine](https://github.com/MUSC-TBIC/etude-engine)." ] }, { "cell_type": "markdown", "id": "6d5aa805-609d-4096-9578-c58a00831acf", "metadata": {}, "source": [ "## Testing CSV output against a gold standard dataset with `CSVEvaluator`" ] }, { "cell_type": "markdown", "id": "a5e1ab74-40cf-4574-be85-46499fd2282f", "metadata": {}, "source": [ "Given a test dataset of clinical notes formatted in a `csv` such as that found in `tests/test.csv`, and a \"gold standard\" dataset with each note split by token, and annotated with the appropriate PII type, we can evaluate the performance of `pyDeid` using the `CSVEvaluator` class.\n", "\n", "First we run `pyDeid` on `tests/test.csv`, outputting in `csv` format." ] }, { "cell_type": "code", "execution_count": 1, "id": "aedb1387-7848-45ed-bb0e-617c25372804", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Processing encounter 3, note Record 3: : 3it [00:01, 2.75it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Diagnostics:\n", " - chars/s = 149.41586216889525\n", " - s/note = 0.36363832155863446\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from pyDeid import pyDeid\n", "\n", "pyDeid(\n", " original_file = \"../../tests/test.csv\",\n", " note_varname = \"note_text\",\n", " encounter_id_varname = \"genc_id\",\n", " note_id_varname = \"note_id\"\n", ")" ] }, { "cell_type": "markdown", "id": "539454f8-cd7d-4e7e-9e6a-eb8a36630e0c", "metadata": {}, "source": [ "From here we use the annotated ground truth dataset.\n", "\n", "We show how to create a ground truth CSV given raw notes in the form of `./tests/test.csv`.\n", "\n", "1. Begin by tokenizing the raw notes using `tokenize_csv()`.\n", "2. Manually annotate the notes for PHI.\n", "3. Using `melt_annotations()`, combine multi-token PHI into a single entry.\n", "\n", "In particular dates and locations are usually multi-token, so we handle them specifically." ] }, { "cell_type": "code", "execution_count": 2, "id": "2675b657-bcfa-40bf-af66-45b22e790624", "metadata": {}, "outputs": [], "source": [ "from pyDeid.phi_types import tokenize_csv, melt_annotations\n", "\n", "tokenize_csv(\n", " input_file = \"../../tests/test.csv\",\n", " output_file = \"../../tests/test_tokenized.csv\",\n", " encounter_id_varname = \"genc_id\",\n", " note_id_varname = \"note_id\",\n", " note_text_varname = \"note_text\"\n", ")" ] }, { "cell_type": "markdown", "id": "0f32579b-ee61-4b3c-a7ff-2ad13d26de85", "metadata": {}, "source": [ "Then we annotate the file `./tests/test_tokenized.csv` by adding the PHI type to the `annotation` column.\n", "\n", "Once that is complete, we combine multi-token PHI. In this example file, dates and locations are split across multiple tokens." ] }, { "cell_type": "code", "execution_count": 3, "id": "5e241a1f-0fb4-4a57-ae67-094c92420fa0", "metadata": {}, "outputs": [], "source": [ "melt_annotations(\n", " input_file = \"../../tests/test_tokenized.csv\",\n", " output_file = \"../../tests/ground_truth_processed.csv\",\n", " merge_annotations = [\"d\", \"l\"]\n", ")" ] }, { "cell_type": "markdown", "id": "c977fa8e-1db8-482f-8775-6a8d05f66b7f", "metadata": {}, "source": [ "Now we have a ground truth dataset that we can use to compare against the output of `pyDeid`.\n", "\n", "To do this we use the `CSVEvaluator` class." ] }, { "cell_type": "code", "execution_count": 4, "id": "28eefc44-fd09-419f-b8be-03e5ef7ab390", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Precision: 1.0\n", " Recall: 1.0\n", " F1: 1.0\n", " \n" ] } ], "source": [ "from pyDeid.phi_types.CSVEvaluator import CSVEvaluator\n", "\n", "evaluator = CSVEvaluator()\n", "\n", "precision, recall, f1 = evaluator.add_ground_truth_file(\"../../tests/ground_truth.csv\", note_id_varname = \"note_id\")\\\n", " .add_result_file(\"../../tests/test__PHI.csv\")\\\n", " .evaluate()\n", "\n", "print(\n", " f\"\"\"\n", " Precision: {precision}\n", " Recall: {recall}\n", " F1: {f1}\n", " \"\"\"\n", ")" ] }, { "cell_type": "markdown", "id": "65ca2682-ebcd-4e64-ac88-19d7e2927a1f", "metadata": {}, "source": [ "## Using the ETUDE Engine" ] }, { "cell_type": "markdown", "id": "ee2870b0-10f2-4d38-ad01-254cb407e1be", "metadata": {}, "source": [ "The ETUDE engine is a well established, standard tool for analyzing de-identification performance against various benchmark dataset formats, such as the `n2c2` format.\n", "\n", "In this section, we use the ETUDE engine to evaluate the performance of `pyDeid` on the `n2c2` dataset. We begin by cloning the repository." ] }, { "cell_type": "raw", "id": "a37c6eaa-dfb0-44a0-9af8-1cb81fa166a1", "metadata": {}, "source": [ "!git clone git@github.com:MUSC-TBIC/etude-engine.git" ] }, { "cell_type": "markdown", "id": "be2e37cc-ac37-4e98-b04a-578de9c4bf23", "metadata": {}, "source": [ "In order to run the evaluation on `n2c2`, `pyDeid` does require the ability to read and write from `xml`. This is available through the `pyDeid_n2c2()` function.\n", "\n", "Ensure that the `n2c2` dataset is saved to some directory such as `./tests/n2c2`." ] }, { "cell_type": "code", "execution_count": null, "id": "fe0c309f-a090-4881-ab8c-9cb88df1bbd2", "metadata": {}, "outputs": [], "source": [ "from pyDeid.n2c2 import pyDeid_n2c2\n", "\n", "pyDeid_n2c2(\n", " input_dir = \"path/to/n2c2_test_data\",\n", " output_dir = \"path/to/pydeid_n2c2_output\",\n", ")" ] }, { "cell_type": "markdown", "id": "34c27aea-5539-4b30-8a65-fc9a5ff24894", "metadata": {}, "source": [ "Note that this is essentially ready to run the evaluation on, however there is a significant difference between how `pyDeid` recognizes names and how names are annotated in the `n2c2` ground truth. In `pyDeid`, first and last names are considered separately, and so we must separate these annotations in the `n2c2` ground truth using `split_multi_word_tags()`." ] }, { "cell_type": "code", "execution_count": null, "id": "e30a1752-832d-47d7-9162-345ee35e831d", "metadata": {}, "outputs": [], "source": [ "from pyDeid import split_multi_word_tags\n", "\n", "split_multi_word_tags(\n", " input_dir = \"path/to/n2c2_test_data\",\n", " output_dir = \"path/to/n2c2_test_data_preprocessed\"\n", ")" ] }, { "cell_type": "markdown", "id": "342ba012-bbc2-497f-95a4-009ab74510f6", "metadata": {}, "source": [ "Now we are ready to run the `ETUDE engine`.\n", "\n", "In order to compare a given reference file with a given test file, the `ETUDE engine` uses a config for each of the reference file, and the tool's output. We provide both of the relevant configs under the `./tests` directory.\n", "\n", "With this, we run the following command from the directory in which we cloned the `ETUDE engine` repository." ] }, { "cell_type": "raw", "id": "8414a3a5-79f2-4e10-a966-b9062dd03d22", "metadata": {}, "source": [ "import os\n", "os.chdir(\"path/to/etude-engine\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "41eb8aba-832f-4af7-ad20-baf46afc42d6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "exact\tTP\tFP\tTN\tFN\n", "micro-average\t8280.0\t1153.0\t0.0\t2816.0\n", "Address\t116.0\t4.0\t0.0\t420.0\n", "Contact Information\t83.0\t48.0\t0.0\t135.0\n", "Identifiers\t81.0\t33.0\t0.0\t536.0\n", "Names\t4031.0\t845.0\t0.0\t714.0\n", "Time\t3969.0\t223.0\t0.0\t1011.0\n", "macro-average by type\t8280.0\t1153.0\t0.0\t2816.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n", " 0%| | 0/514 [00:00