pyDeid package¶
Subpackages¶
- pyDeid.phi_types package
- Submodules
- pyDeid.phi_types.AddressPHIFinder module
- pyDeid.phi_types.AmericanPIIFinder module
- pyDeid.phi_types.DatesPHIFinder module
Date
DatesPHIFinder
DatesPHIFinder.date()
DatesPHIFinder.date_range()
DatesPHIFinder.date_with_context_check()
DatesPHIFinder.days
DatesPHIFinder.find()
DatesPHIFinder.find_time()
DatesPHIFinder.holiday()
DatesPHIFinder.monthly()
DatesPHIFinder.months
DatesPHIFinder.name
DatesPHIFinder.season_year()
DatesPHIFinder.seasons
DatesPHIFinder.year_with_context_check()
Time
- pyDeid.phi_types.EmailPHIFinder module
- pyDeid.phi_types.HospitalNamePHIFinder module
- pyDeid.phi_types.MrnPHIFinder module
- pyDeid.phi_types.NamesPHIFinder module
- pyDeid.phi_types.OhipPHIFinder module
- pyDeid.phi_types.PHITypeFinder module
- pyDeid.phi_types.PostalCodePHIFinder module
- pyDeid.phi_types.SinPHIFinder module
- pyDeid.phi_types.TelephoneFaxPHIFinder module
- pyDeid.phi_types.utils module
- Module contents
- pyDeid.process_note package
- pyDeid.wordlists package
Submodules¶
pyDeid.CSVEvaluator module¶
- class pyDeid.CSVEvaluator.CSVEvaluator¶
Bases:
object
- add_ground_truth_file(filepath: str, encounter_id_varname: str = 'genc_id', note_id_varname: str | None = None, start_index_varname: str = 'start', end_index_varname: str = 'end', annotation_varname: str = 'annotation', ignore_annotations: List[str] | None = ['t', 'T']) None ¶
Set the ground truth file and its column names.
Args: filepath (str): Path to the ground truth CSV file. encounter_id_varname (str): Name of the encounter ID column. note_id_varname (Optional[str]): Name of the note ID column (if applicable). start_index_varname (str): Name of the start index column. end_index_varname (str): Name of the end index column. annotation_varname (str): Name of the annotation column.
- add_result_file(filepath: str, encounter_id_varname: str = 'encounter_id', phi_start_varname: str = 'phi_start', phi_end_varname: str = 'phi_end', types_varname: str = 'types') None ¶
Set the result file and its column names.
Args: filepath (str): Path to the result CSV file. encounter_id_varname (str): Name of the encounter ID column. phi_start_varname (str): Name of the PHI start index column. phi_end_varname (str): Name of the PHI end index column. types_varname (str): Name of the PHI types column.
- calculate_metrics() Dict[str, float] ¶
Calculate precision, recall, and F1 score based on manual and de-identified annotations.
Returns: Dict[str, float]: A dictionary containing ‘precision’, ‘recall’, and ‘f1_score’.
- combine_files() None ¶
Merge ground truth and result files, handling edge cases.
- evaluate() Dict[str, float] ¶
Evaluate the de-identification results and return metrics.
Returns: Dict[str, float]: A dictionary containing ‘precision’, ‘recall’, and ‘f1_score’.
- standardize_phi() None ¶
Standardize PHI types in the merged DataFrame.
- static validate_and_read_file(filepath, columns) DataFrame ¶
Validate that all required columns are present in input files.
- pyDeid.CSVEvaluator.melt_annotations(input_file: str, output_file: str, merge_annotations: List[str] | None = None) None ¶
Combine annotated tokens that are part of the same “multitoken” for specified annotations.
This function reads a CSV file with tokenized text and annotations, and combines tokens that have the same annotation (from the specified list) and are close to each other (less than 3 characters apart).
- Parameters:
input_file (str) – Path to the input CSV file.
output_file (str) – Path to the output CSV file to be created.
merge_annotations (Optional[List[str]]) – List of annotations for which tokens should be merged. If None, all non-empty annotations will be considered for merging.
- Returns:
None
Note
The input CSV should have the following columns: [encounter_id, note_id, token, start, end, annotation] Only tokens with annotations in the merge_annotations list will be combined.
- pyDeid.CSVEvaluator.tokenize_csv(input_file: str, output_file: str, encounter_id_varname: str = 'genc_id', note_id_varname: str = 'note_id', note_text_varname: str = 'note_text') None ¶
Tokenize the text in a CSV file and write the results to a new CSV file.
This function reads a CSV file, tokenizes the text in a specified column, and writes the tokens along with their start and end indices to a new CSV file. It preserves the encounter ID and note ID from the input file.
- Parameters:
input_file (str) – Path to the input CSV file.
output_file (str) – Path to the output CSV file to be created.
encounter_id_varname (str, optional) – Name of the column containing encounter IDs. Defaults to “genc_id”.
note_id_varname (str, optional) – Name of the column containing note IDs. Defaults to “note_id”.
note_text_varname (str, optional) – Name of the column containing the text to be tokenized. Defaults to “note_text”.
- Raises:
ValueError – If any of the specified column names are not found in the input file.
- Returns:
None
Note
The function uses NLTK’s word_tokenize for tokenization.
Punctuation is removed from the end of each token.
Empty tokens after cleaning are skipped.
The output CSV will have the following columns: [encounter_id_varname, note_id_varname, ‘token’, ‘start’, ‘end’]
The ‘start’ and ‘end’ columns represent the character indices of each token in the original text.
Example
>>> tokenize_csv('input.csv', 'output.csv', 'encounter_id', 'note_id', 'text')
pyDeid.Deidentifier module¶
- class pyDeid.Deidentifier.Deidentifier¶
Bases:
object
- run(verbose=True)¶
Generic run function that will direct to the appropriate ._run_on_X() method depending on the input file type.
Currently we only support CSVs but this will allow us to expand to XML, etc.
- Parameters:
verbose (logical) – Whether or not to print a progress bar with de-identification progress updates.
pyDeid.cli module¶
pyDeid.n2c2 module¶
pyDeid.pyDeid module¶
- pyDeid.pyDeid.deid_string(note: str, custom_dr_first_names: Set[str] = None, custom_dr_last_names: Set[str] = None, custom_patient_first_names: Set[str] = None, custom_patient_last_names: Set[str] = None, named_entity_recognition: bool = False, two_digit_threshold: int = 30, valid_year_low: int = 1900, valid_year_high: int = 2050, types: List[str] = ['names', 'dates', 'sin', 'ohip', 'mrn', 'locations', 'hospitals', 'contact'], **custom_regexes: str)¶
Remove and replace PHI from a single string for debugging
- Parameters:
note – String with PHI to de-identify.
custom_dr_first_names – (Optional) set containing site-specific physician first names, generally taken from the physician mapping file. This set should exist in RAM and will remain in RAM during de-identification.
custom_dr_last_names – (Optional) set similar to custom_dr_first_names.
custom_patient_first_names – (Optional) set containing site-specific patient first names, generally taken from the master linking log. This set should exist in RAM and will remain in RAM during de-identification.
custom_patient_last_names – (Optional) set similar to custom_patient_first_names.
named_entity_recognition – Whether to use NER as implemented in the spaCy package for better detection of names.
detect_only – Boolean to decide on whether to only output detected phis
**custom_regexes – These are named arguments that will be taken as regexes to be scrubbed from the given note. The keyword/argument name itself will be used to label the PHI in the phi_output. Note that all custom patterns will be repalced with <PHI>.
- Returns:
If detect_only=True, then only output a dictionary of found PHIs else A tuple where the first element is a dictionary of found PHI and the second element is the deidentified string.
- Return type:
None
- pyDeid.pyDeid.display_deid(original_string, phi)¶
Visualize pyDeid output with the help of spaCy
- Parameters:
original_string – Text which was passed through a de-identification function.
phi – A list of dictionaries with start and end positions for the original PHI in the original string, the PHI itself which was replaced, and the start and end positions for the surrogate it was replaced with. This type of data structure is the same as what is output by deid_string.
- pyDeid.pyDeid.pyDeid(original_file: str | Path, encounter_id_varname: str = 'genc_id', note_varname: str = 'note_text', note_id_varname: str | None = None, enable_replace: bool = True, return_surrogates: bool = True, max_field_size: Literal['auto', 131072] | int = 131072, file_encoding: str = 'utf-8', read_error_handling: str = None, new_file: str | Path | None = None, phi_output_file: str | Path | None = None, phi_output_file_type: Literal['json', 'csv'] = 'csv', mll_file: str | None = None, named_entity_recognition: bool = False, two_digit_threshold: int = 30, valid_year_low: int = 1900, valid_year_high: int = 2050, custom_dr_first_names: Set[str] | None = None, custom_dr_last_names: Set[str] | None = None, custom_patient_first_names: Set[str] | None = None, custom_patient_last_names: Set[str] | None = None, verbose: bool = True, types: List[str] = ['names', 'dates', 'sin', 'ohip', 'mrn', 'locations', 'hospitals', 'contact'], encounter_id_varname_mll: str | None = 'genc_id', num_threads: int | None = 1, **custom_regexes: CustomRegex | str)¶
Remove and replace PHI from free text
- Parameters:
original_file – Path to the original file containing all PHI in CSV format.
new_file – Desired path to write the output CSV. If not supplied, will default to {original filename without extension}__DEID.csv.
phi_output_file –
If phi_output_file_type == ‘json’, the desired path to write the output JSON. If phi_output_file_type == ‘csv’, the desired path to write the output CSV. If not supplied, will default to:
{original filename without extension}__PHI.{phi_output_file_type}.
note_varname – Column name in original_file with the free text note to de-identify.
encounter_id_varname – Column name in original_file with the encounter-level ID. Could be any identifier column (may not necessariliy be unique).
note_id_varname – Column name in original_file with the note-level ID. Could be any unique identifier column.
phi_output_file_type – If ‘json’, will format the phi_output_file into an efficient JSON nested data structure, which is lighter on disk space. If ‘csv’, will output a tidy dataframe formatted as a CSV to phi_output_file. This data structure contains redundant information, but is lighter on memory.
custom_dr_first_names – (Optional) set containing site-specific physician first names, generally taken from the physician mapping file. This set should exist in RAM and will remain in RAM during de-identification.
custom_dr_last_names – (Optional) set similar to custom_dr_first_names.
custom_patient_first_names – (Optional) set containing site-specific patient first names, generally taken from the master linking log. This set should exist in RAM and will remain in RAM during de-identification.
custom_patient_last_names – (Optional) set similar to custom_patient_first_names.
verbose – Show a progress bar while running through the file with information about the current note being processed.
named_entity_recognition – Whether to use NER as implemented in the spaCy package for better detection of names.
file_encoding – Specify a non-default (‘utf8’) encoding for the file being read, and therefore the file to which the result is being written.
read_error_handling – For characters in the input file which do not match the specified system default encoding. See python built-in open documentation. Use ignore to skip, replace to pick a placeholder character, etc.
max_field_size – For very large notes, prevents _csv.Error: field larger than field limit. ‘auto’ will find the max size that does not result in an OverflowError. The default is usually 131072.
regex_replace – Indicate if replacing PHIs using regex is desired or not
mll_file – Filepath for MLL if the MLL replacement option is desired.
num_threads – Number of parallel processing workers to use.
**custom_regexes – These are named arguments that will be taken as regexes to be scrubbed from the given note. The keyword/argument name itself will be used to label the PHI in the phi_output. Note that all custom patterns will be replaced with <PHI>.
- Returns:
Nothing is explicitly returned. Side effects produce a de-identified CSV file under new_file and PHI replaced under phi_output_file.
- Return type:
None
- pyDeid.pyDeid.reid_string(x: str, phi: List[Dict[str, int | str]])¶
Replace surrogates from a single string with original PHI
- Parameters:
x – String with surrogates to re-identify.
phi – A list of dictionaries with start and end positions for the original PHI in the original string, the PHI itself which was replaced, and the start and end positions for the surrogate it was replaced with. This type of data structure is the same as what is output by deid_string.
- Returns:
The original string which was de-identified, with surrogates replaced with original PHI.
- Return type:
str
pyDeid.pyDeidBuilder module¶
- class pyDeid.pyDeidBuilder.pyDeidBuilder¶
Bases:
object
- build()¶
Build a Deidentifier object.
- Returns:
An object that is designed to de-identify a file given the configuration specified by the pyDeidBuilder.
- Return type:
- replace_phi(enable_replace=True, return_surrogates: bool = True)¶
Replaces found instances of PHI in the note to de-identify.
- Parameters:
enable_replace (bool, optional) – In the processed, de-identified output file, whether or not to replace all instances of found PHI with surrogates. Defaults to True.
return_surrogates (bool, optional) – In the PHI output file, whether or not to output which surrogates the PHI were replaced with. Defaults to True.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_custom_namelists(custom_dr_first_names: Set = None, custom_dr_last_names: Set = None, custom_patient_first_names: Set = None, custom_patient_last_names: Set = None)¶
Supply custom name lists for patient and doctor names when available.
- Parameters:
custom_dr_first_names (Set, optional) – List of known doctor first names. Defaults to None.
custom_dr_last_names (Set, optional) – List of known doctor last names. Defaults to None.
custom_patient_first_names (Set, optional) – List of known patient first names. Defaults to None.
custom_patient_last_names (Set, optional) – List of known patient last names. Defaults to None.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_custom_regex(pattern: str, phi_type: str = 'custom_regexes', surrogate_builder_fn: Callable = None, arguments: list = [])¶
Specify known patterns specific to a file that need to be removed.
- Parameters:
pattern (str) – A regular expression to be detected in the note.
phi_type (str, optional) – The name of the custom PHI. Defaults to “custom_regex”.
surrogate_builder_fn (Callable, optional) – A function that takes no arguments, and returns a random surrogate for the phi_type of interest. Defaults to <PHI>.
- Raises:
ValueError – replace_phi must be called prior to defining custom_regex with a surrogate builder function.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_deid_output_file(new_file: str | Path = None)¶
Allows for a custom filename for the de-identified output file.
- Parameters:
new_file (Union[str, Path], optional) – Custom name for the output file. Defaults to None.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_input_file(original_file: str | Path, encounter_id_varname: str = 'genc_id', note_varname: str = 'note_text', note_id_varname: str = None, max_field_size: int = 131072, file_encoding: str = 'utf-8', read_error_handling: str = None)¶
Specify the file to be de-identified.
- Parameters:
original_file (Union[str, Path]) – The path to the file to de-identify.
encounter_id_varname (str) – The unique identifier column name in the file to de-identify.
note_varname (str) – The column name of the input file that contains the actual note to be de-identified.
note_id_varname (str, optional) – When the input file contains multiple notes per encounter, the unique note identifier. Defaults to None.
max_field_size (int, optional) – Passed to csv.field_size_limit. Defaults to 131072.
file_encoding (str, optional) – Passed to csv.DictReader. Defaults to “utf-8”.
read_error_handling (str, optional) – Passed to csv.DictReader. Defaults to None.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_max_field_size(max_field_size: Literal['auto'] | int = 131072)¶
Avoids _csv.Error: field larger than field limit.
- Parameters:
max_field_size (Union[Literal['auto'], int], optional) – Maximum field size allowed by the parser. Defaults to 131072.
- Raises:
ValueError – Could not determine the field size limit automatically.
ValueError – Provided field_size_limit was neither ‘auto’ nor int.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_mll(filename: str | Path, encounter_id_varname: str = 'genc_id', file_encoding: str = 'utf-8', read_error_handling: str = None, use_namelists: bool = True, patient_name_columns: Dict[str, str] = {'first_name': 'first_name', 'last_name': 'last_name'}, doctor_name_columns: Dict[str, List[str]] = {'first_name': ['disphy_first_name', 'admphy_first_name', 'mrp_first_name'], 'last_name': ['disphy_last_name', 'admphy_last_name', 'mrp_last_name']})¶
Provide a master linking log that has encounter-specific PHI that can be linked to the notes corresponding to that encounter for improved PHI detection.
- Parameters:
filename (Union[str, Path]) – Path to the master linking log. Expected to be in CSV format.
encounter_id_varname (str) – The unique identifier column name in the file to de-identify.
file_encoding (str, optional) – Passed to csv.DictReader. Defaults to “utf-8”.
read_error_handling (str, optional) – Passed to csv.DictReader. Defaults to None.
use_namelists (bool, optional) – Use ALL patient and doctor names from the MLL as a namelist for all notes.
patient_name_columns (dict[str, str], optional) – Mapping of name type to column name for patient names. Example: {“first_name”: “patient_first”, “last_name”: “patient_last”}
doctor_name_columns (dict[str, list[str]], optional) –
Mapping of name type to list of column names for doctor names. Example: {
”first_name”: [“attending_first”, “consulting_first”], “last_name”: [“attending_last”, “consulting_last”]
}
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- Raises:
ValueError – If encounter_id_varname is not found in the CSV file.
- set_multithreading(threads: int)¶
- set_ner_pipeline(model: Language = None)¶
Adds a named entity recognition step using a spaCy NER pipeline.
- Parameters:
model (Language, optional) – Any spaCy NER pipeline. See the “Using spaCy with pyDeid” tutorial for more information. Defaults to None.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_phi_output_file(phi_output_file: str | Path = None, phi_output_file_type: Literal['csv'] = 'csv')¶
Allows for a custom filename for the PHI output file.
- Parameters:
phi_output_file (_type_, optional) – Custom name for the output file. Defaults to None.
phi_output_file_type (Literal['csv'], optional) – What format to output the PHI to. Currently only supports “csv”.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_phi_types(types: List[str])¶
Specify which types of PHI to de-identify.
- Parameters:
types (List[str]) – Currently, pyDeid supports names, dates, sin, ohip, mrn, locations, hospitals, and contact.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type:
- set_valid_years(two_digit_threshold=30, valid_year_low=1900, valid_year_high=2050)¶
Specifies cutoffs for “reasonable” years in the note. When set appropriately, reduces false positives.
- Parameters:
two_digit_threshold (int, optional) – Lowest expected two digit year. Defaults to 30.
valid_year_low (int, optional) – Lowest expected four digit year. Defaults to 1900.
valid_year_high (int, optional) – Highest expected four digit year. Defaults to 2050.
- Returns:
Instance of the pyDeidBuilder class, allowing method chaining.
- Return type: