Check data coverage
data_coverage.Rd
This function facilitates data coverage checks that analysts should perform during data exploration & cohort generation. Specifically, this function checks for "gaps" in the data timeline (e.g., are there any hospitals/time periods for which GEMINI did not receive any data at all?), and also provides more detailed insights into data volume & coverage by month and hospital (e.g., what percentage of encounters have an entry in a given table?). All checks performed by this function are applied at the table level (i.e., missing values in individual columns are not considered and should be checked separately).
Usage
data_coverage(
dbcon,
cohort,
table,
plot_timeline = TRUE,
plot_coverage = TRUE,
hospital_label = NULL,
hospital_group = NULL,
custom_dates = NULL,
as_plotly = FALSE,
...
)
Arguments
- dbcon
(
DBIConnection
)
A database connection to any GEMINI database.- cohort
(
data.frame
ordata.table
) Cohort table with all relevant encounters of interest, where each row corresponds to a single encounter. Must contain the following columns:genc_id
: GEMINI Encounter IDhospital_num
|hospital_id
: Hospital identifierdischarge_date_time
- table
(
character
) Which table(s) to include. If multiple, specify a character vector (e.g.,table = c("lab", "pharmacy", "radiology")
).For HPC4Health users: Please specify the full table name as it is listed in your datacut (e.g.,
"admdad_subset"
instead of only"admdad"
).- plot_timeline
(
logical
) Flag indicating whether to plot an overview of data timelines by hospital and table. Note: This plot only shows a rough overview of min-max dates per table for each hospital. Gaps in this plot illustrate time periods where no data were available for >28 consecutive days. Importantly, for time periods without any gaps, overall data volume may still be low and certain entries/columns may still be missing. Users should further inspect the coverage plots (plot_coverage = TRUE
) and perform additional customized checks based on their needs.- plot_coverage
(
logical
) Flag indicating whether to plot data coverage. These plots show the percentage ofgenc_ids
with an entry in the table(s) of interest. Data coverage is plotted by hospital and discharge month (because GEMINI data are pulled based on encounters' discharge date). Users should carefully inspect these plots for any drops/gaps in coverage that may not be visible in the timeline plots (seeplot_timeline
above). When plotting theadmdad
table, the function will plot the number of encounters by discharge month to show overall data volume (note: by definition, 100% ofgenc_ids
in GEMINI data have an entry inadmdad
).- hospital_label
(
character
) Optional: Name of variable incohort
table that corresponds to custom label for hospitals (e.g., letters A-E instead of hospital_num 101-105). Will be used for plotting purposes.- hospital_group
(
character
) Optional: Name of variable incohort
table that corresponds to grouping of hospitals (e.g., Teaching vs. Non-teaching). Hospitals will be grouped accordingly in all plots/output tables.- custom_dates
(
data.frame
|data.table
) Optional input allowing users to specify a customized timeline (in 'yyyy-mm-dd' format) to be included for a given hospital*table combination. The user-provided input overwrites the corresponding row(s) in thelookup_data_coverage
table. This can be used to exclude time periods (e.g., due to data quality issues) and generate customized timeline plots. For example, let's say you identified a data quality issue in the transfusion table at hospital 104 for discharge dates < 2019-01-01. To only include transfusion data for encounters discharged after this time period, specify:custom_dates <- data.frame( data = "transfusion", hospital_num = 104, min_date = "2019-01-01", max_date = "2022-06-30" )
This will overwrite the
min_date
/max_date
values for site 104 transfusion data in thelookup_data_coverage
table (data for all other hospital*table combinations will remain the same). Additionally, the coverage flags bygenc_id
(in returnedcoverage_flag_enc
) and timeline plot (seeplot_timeline
) will be adjusted according to the user-provided dates. The coverage plot (seeplot_coverage
) is not affected by the user-specified thecustom_dates
input.- as_plotly
(
logical
) Will return any figures as interactive plots usingplotly
. Note that this may affect plot aesthetics. The flag will be ignored if theplotly
package is not installed.- ...
Additional inputs that can be passed to control plot aesthetics in
plot_timeline
orplot_coverage
plots, such as:base_size
: Font size (default = 12)colors
: Plot color(s) (default = gemini_colors(1))hospital_group
: Name of variable in cohort specifying color grouping of hospitals (e.g., Teaching/Non-teaching); for the timeline plot, this is only applied when plotting a single table (otherwise, color grouping is applied to different table names by default)
For coverage plots only (inputs are passed to
plot_over_time()
):time_int
: Time interval used to aggregate data (e.g., by"month"
= default,"quarter"
, or"year"
)ylimits
: Limits of y-axis (e.g.,c(0, 50)
)scales
: Passed to facet wrap to control if y-scales are"fixed"
(default) or"free"
(only works if noylimits
specified)
Value
If the plotting flags are set to FALSE
, this function will return a single
data.table
object with a flag for each genc_id
indicating whether the
encounter was discharged during a time period in which data for a given
table (e.g., "lab"
) were in principle available. If the flag is FALSE
,
the genc_id
was dicharged during a time period where GEMINI did not receive
any data for the table of interest. If the flag is TRUE
, the genc_id
was
discharged during a time period where GEMINI received some data from a
given hospital (however, coverage may still be low, so users are advised to
perform additional coverage checks, e.g., by using the plotting features of
this function).
When the plotting flags are set to TRUE
(default), the function will
return additional data tables (output[["data"]]
) and plots
(output[["plots"]]
):
If
plot_timeline = TRUE
: Returnstimeline_plot
showing an overview of data timelines by hospital for each table. Will also return the corresponding data tabletimeline_data
that lists the min - max dates of each table per site (in long format, where tables with multiple rows per hospital indicate gaps in the data timeline)If
plot_coverage = TRUE
: Returnscoverage_plot
showing the percentage ofgenc_ids
with an entry in a given table by discharge month and hospital. Will also return the plotted data ascoverage_data
to facilitate further inspection.
Details
data_coverage
provides analysts with a tool to inform their decisions
about which hospitals/time periods to include in their analyses,
depending on the data tables of interest. For example, if a project relies on
lab data (e.g., mlaps
variable), users should carefully inspect lab data
coverage (see plot_coverage
below). If lab data coverage is high (e.g.,
\>95% of genc_ids
have an entry in the lab table), individual genc_ids
may still not exist in the lab table (e.g., because lab testing was not
indicated) and individual lab columns might still have missing values.
Users should carefully consider how to handle these cases depending on the
context of their analyses.
Warning!
Data coverage checks should generally be performed on the whole dataset, prior to applying any additional cohort inclusions/exclusions.
If you are an HPC4Health user and your datacut has been pre-filtered based on certain inclusion/exclusion criteria (e.g., diagnosis codes), please keep in mind that the coverage plots (see
plot_coverage
) may be skewed in smaller/pre-filtered samples. Please reach out to the GEMINI team if you need additional support.Data coverage checks are particularly relevant for clinical data tables (e.g., lab, pharmacy, radiology, transfusions, vitals, clinical notes etc.).
This function should be used as a starting point for data coverage checks, but users are advised to perform additional checks based on their specific needs.
Examples
if (FALSE) { # \dontrun{
drv <- dbDriver("PostgreSQL")
dbcon <- DBI::dbConnect(drv,
dbname = "db",
host = "domain_name.ca",
port = 1234,
user = getPass("Enter user:"),
password = getPass("password")
)
cohort <- dbGetQuery(db, "SELECT genc_id FROM admdad;")
## run function with default flags to create all plots
# Note: This might take a while to run...
coverage <- data_coverage(dbcon, cohort, table = c("admdad", "radiology"))
# get flags per encounter based on encounter's discharge date
enc_flag <- coverage[["data"]][1] # coverage[["data"]]$coverage_flag_enc
# get data timeline (min-max dates)
data_timeline <- coverage[["data"]][2] # coverage[["data"]]$timeline_data
# get % encounters with entry in each table by discharge month & hospital
prct_coverage <- coverage[["data"]][3] # coverage[["data"]]$coverage_data
## run function without any plots
# (will only return data.table with encounter-level flag)
coverage <- data_coverage(
dbcon,
cohort,
table = c("admdad", "radiology"),
plot_timeline = FALSE,
plot_coverage = FALSE
)
} # }