Generate Simulated Diagnosis Data Table
dummy_diag.RdThis function generates simulated data table resembling ipdiagnosis
or erdiagnosis tables that can be used for testing or demonstration purposes.
It internally calls sample_icd() function to sample ICD-10 codes and
accepts arguments passed to sample_icd() for customizing the sampling scheme.
Usage
dummy_diag(
nid = 1000,
n_hospitals = 10,
cohort = NULL,
ipdiagnosis = TRUE,
diagnosis_type = NULL,
seed = NULL,
...
)Arguments
- nid
(
integer)
Number of unique encounter IDs (genc_id) to simulate. Value must be greater than 0. Optional whencohortis provided.- n_hospitals
(
integer)
Number of hospitals to simulate in the resulting data table. Optional whencohortis provided.- cohort
(
data.frame or data.table)
Optional, the administrative data frame with the columns:genc_id(integer): GEMINI encounter IDhospital_num(integer): hospital ID numberscohorttakes precedence over parametersnidandn_hospitals. Whencohortis not NULL,nidandn_hospitalsare ignored.
- ipdiagnosis
(
logical)
Default to "TRUE" and returns simulated "ipdiagnosis" table. If FALSE, returns simulated "erdiagnosis" table. See tables in GEMINI Data Repository Dictionary.- diagnosis_type
(
character vector)
The type(s) of diagnosis to return. Possible diagnosis types are ("M", 1", "2", "3", "4", "5", "6", "9", "W", "X", and "Y"). Regardless ofdiagnosis_typeinput, theipdiagnosistable is defaulted to always return type "M" for the first row of each encounter.- seed
(
integer)
Optional, a number to assign the seed to.- ...
Additional arguments for ICD code sampling scheme. See
sample_icd()for details.
Value
(data.table)
A data table containing simulated data of
genc_id, (er)_diagnosis_code, (er)_diagnosis_type, hospital_num,
and other fields found in the respective diagnosis table.
Details
To ensure simulated table resembles "ip(er)diagnosis" table, the following characteristics are applied to fields:
genc_id: Numerical identification of encounters starting from 1. The number of unique encounters is defined byn. The total number of rows is defined bynrow, where the number of rows for each encounter is random, but each encounter has at least one row.hospital_num: Numerical identification of hospitals from 1 to 5. All rows of an encounter are linked to a single hospitaldiagnosis_code: "ipdiagnosis" table only. Simulated ICD-10 diagnosis codes. Each encounter can be associated with multiple diagnosis codes in long format.diagnosis_type: "ipdiagnosis" table only. The first row of each encounter is consistently assigned to the diagnosis type "M". For the remaining rows, ifdiagnosis_typeis specified by users, diagnosis types are sampled randomly from values provided; ifdiagnosis_typeis NULL, diagnosis types are sampled from ("1", "2", "3", "4", "5", "6", "9", "W", "X", and "Y"), with sampling probability proportionate to their prevalence in the "ipdiagnosis" table.diagnosis_cluster: "ipdiagnosis" table only. Proportionally sampled from values that have a prevalence of more than 1% in the "diagnosis_cluster" field of the "ipdiagnosis" table, which are ("", "A", "B").diagnosis_prefix: "ipdiagnosis" table only. Proportionally sampled from values that have a prevalence of more than 1% in the "diagnosis_prefix" field of the "ipdiagnosis" table, which are ("", "N", "Q", "6").er_diagnosis_code: "erdiagnosis" table only. Simulated ICD-10 diagnosis codes. Each encounter can be associated with multiple diagnosis codes in long format.er_diagnosis_type: "erdiagnosis" table only. Proportionally sampled from values that have a prevalence of more than 1% in the "er_diagnosis_type" field of the "erdiagnosis" table, which are ("", "M", "9", "3", "O").
Note
The following fields (er)diagnosis_code, (er)diagnosis_type, diagnosis_cluster, diagnosis_prefix
are simulated independently.
Therefore, the simulated combinations may not reflect the interrelationships of these fields in actual data.
For example, specific diagnosis codes may be associated with specific diagnosis types,
diagnosis clusters, or diagnosis prefix in reality.
However, these relationships are not maintained for the purpose of generating dummy data.
Users require specific linkages between these fields should consider customizing
the output data or manually generating the desired combinations.
Examples
### Simulate an erdiagnosis table for 5 unique subjects with total 20 records:
if (FALSE) { # \dontrun{
set.seed(1)
erdiag <- dummy_diag(nid = 50, n_hospitals = 2, ipdiagnosis = F)
} # }
### Simulate an erdiagnosis table including data from `cohort`
cohort <- dummy_admdad()
erdiag <- dummy_diag(cohort = cohort)
### Simulate an ipdiagnosis table with diagnosis codes starting with "E11":
if (FALSE) { # \dontrun{
set.seed(1)
ipdiag <- dummy_diag(nid = 50, n_hospitals = 20, ipdiagnosis = T, pattern = "^E11")
} # }
### Simulate a ipdiagnosis table with random diagnosis codes in diagnosis type 3 or 6 only:
if (FALSE) { # \dontrun{
set.seed(1)
ipdiag <- dummy_diag(nid = 50, n_hospitals = 10, diagnosis_type = (c("3", "6"))) %>%
filter(diagnosis_type != "M") # remove default rows with diagnosis_type="M" from each ID
} # }
### Simulate a ipdiagnosis table with ICD-10-CA codes:
if (FALSE) { # \dontrun{
drv <- dbDriver("PostgreSQL")
dbcon <- DBI::dbConnect(drv,
dbname = "db",
host = "172.XX.XX.XXX",
port = 1234,
user = getPass("Enter user:"),
password = getPass("password")
)
set.seed(1)
ipdiag <- dummy_diag(nid = 5, n_hospitals = 2, ipdiagnosis = T, dbcon = dbcon, source = "icd_lookup")
} # }