Cell Suppression and Table 1
cell_suppression_and_table1.Rmd
Introduction
Cell suppression is a technique to withhold or “suppress” confidential or identifying data in tabular formats.
It is common practice in clinical and epidemiological research to present a table of baseline patient characteristics in the study population. This is commonly referred to as Table 1.
In order to preserve patient privacy and reduce the risk of re-identification, GEMINI employs a cell suppression strategy for any Table 1 that is published or shared externally.
The general rule is to hide any subgroup with fewer than 6 patients (<= 5). There are some special cases which will be discussed in detail below.
Note that there exists a variety of methods to produce a Table 1. Below are some examples:
Both packages provide their own functionality, but neither implement
cell suppression. Due to the design of table1
, it was
chosen as a candidate for this added functionality, implemented in
various functions described below.
Setup
Dummy data
For this vignette, we will create a dummy dataset to summarize:
set.seed(3)
gender <- c(sample(c("M", "F"), size = 95, replace = TRUE), rep(NA, times = 5))
exposure <- sample(
c("pre-pandemic", "pandemic", "post-pandemic"),
size = 100, replace = TRUE
)
age <- rnorm(100, mean = 70, sd = 5)
condition <- sample(
c("DVT", "CVD", "DM", "Pneumonia", "Dementia"),
size = 100, replace = TRUE, prob = c(0.01, 0.20, 0.4, 0.35, 0.05)
)
data <- data.frame(gender = gender, exposure = exposure, age = age, condition = condition)
Using table1
Although the vignette linked above provides an in-depth exploration
of all table1
functionality, we will review the function
with a very simple example below. We summarize all available patient
characteristics, stratifying by the exposure variable:
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) |
age | ||||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | 69.8 (5.68) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | 70.2 [58.0, 83.4] |
condition | ||||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) | 19 (19.0%) |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) | 4 (4.0%) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) | 3 (3.0%) |
Note that patients with particular conditions can be easily identified, sometimes to the precision of the exact patient (such as the patient with DVT pre-pandemic). The goal of this vignette is to show how to protect these patients’ privacy.
Cell suppression
Default
In order to perform cell suppression with table1()
,
Rgemini exports
render_cell_suppression.default()
.
library(Rgemini)
table1(
~ gender + age + condition | exposure,
data = data,
render = render_cell_suppression.default
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) |
age | ||||
Mean (SD) | 69.527 (± 5.527) | 69.723 (± 5.695) | 70.220 (± 5.929) | 69.840 (± 5.677) |
condition | ||||
CVD | (suppressed) | 7 (24.1%) | (suppressed) | 19 (19.0%) |
Dementia | (suppressed) | (suppressed) | 0 (0.0%) | (suppressed) |
DM | 15 (44.1%) | (suppressed) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0.0%) | (suppressed) | (suppressed) | (suppressed) |
Important conceptual notes:
- True zeros are not suppressed.
True zeros risk identifying no patient, and therefore we do not hide valuable information that does not bring with it the risk of re-identification.
- Cells that originally had 6 or more patients are also suppressed.
Consider the group of patients in the post-pandemic subgroup. Note that patients with CVD post-pandemic were suppressed even though there were originally seven patients in that subgroup. Why is that? Had we only suppressed patients with DVT (one patient), knowing the total number of patients post-pandemic (37), we could have reverse-calculated the number of patients with DVT (with reasonable precision).
Therefore the algorithm is designed to continue suppressing successively larger groups until the total number of suppressed patients is six or more, such that they cannot be reverse-calculated with reasonable precision. The default option shows the desired behaviour and the most conservative with respect to patient privacy.
Customizing output for continuous variables - change to medians
We can specify whether to use a median or mean for continuous
variables by specifying a continuous_fn
.
table1(
~ gender + age + condition | exposure,
data = data,
render = render_cell_suppression.default,
continuous_fn = 'median'
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) |
age | ||||
Median [Q1, Q3] | 68.478 [65.871, 72.230] | 70.639 [65.050, 73.640] | 70.341 [65.556, 74.247] | 70.180 [65.335, 73.544] |
condition | ||||
CVD | (suppressed) | 7 (24.1%) | (suppressed) | 19 (19.0%) |
Dementia | (suppressed) | (suppressed) | 0 (0.0%) | (suppressed) |
DM | 15 (44.1%) | (suppressed) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0.0%) | (suppressed) | (suppressed) | (suppressed) |
Note however that we can render medians without enabling cell suppression, if desired.
table1(
~ gender + age + condition | exposure,
data = data,
render.continuous = render_median.continuous
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) |
age | ||||
Median [Q1, Q3] | 68.478 [65.871, 72.230] | 70.639 [65.050, 73.640] | 70.341 [65.556, 74.247] | 70.180 [65.335, 73.544] |
condition | ||||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) | 19 (19.0%) |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) | 4 (4.0%) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) | 3 (3.0%) |
We can also specify a render function only for a particular variable type:
table1(
~ gender + age + condition | exposure,
data = data,
render.continuous = render_cell_suppression.continuous,
continuous_fn = 'median'
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) |
age | ||||
Median [Q1, Q3] | 68.478 [65.871, 72.230] | 70.639 [65.050, 73.640] | 70.341 [65.556, 74.247] | 70.180 [65.335, 73.544] |
condition | ||||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) | 19 (19.0%) |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) | 4 (4.0%) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) | 3 (3.0%) |
Finer control
render_cell_suppression.default
is a wrapper around
individual render functions for each covariate datatype. For
finer control, use the primary cell suppression functionality as
implemented in the following render functions:
render_cell_suppression.categorical()
render_cell_suppression.continuous()
render_cell_suppression.discrete()
render_cell_suppression.missing()
render_cell_suppression.strat()
We can apply cell suppression directly to covariates of particular data types by supplying these custom renderer functions.
Suppress categorical variables only
table1(
~ gender + age + condition | exposure,
data = data,
render.categorical = render_cell_suppression.categorical
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) |
age | ||||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | 69.8 (5.68) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | 70.2 [58.0, 83.4] |
condition | ||||
CVD | (suppressed) | 7 (24.1%) | (suppressed) | 19 (19.0%) |
Dementia | (suppressed) | (suppressed) | 0 (0.0%) | (suppressed) |
DM | 15 (44.1%) | (suppressed) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0.0%) | (suppressed) | (suppressed) | (suppressed) |
We may also want to only display a single level for binary variables
(such as with gender
). We can do this through the optional
single_level_binary
variable.
table1(
~ gender + age + condition | exposure,
data = data,
render.categorical = render_cell_suppression.categorical,
single_level_binary = TRUE
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
Missing | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) |
age | ||||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | 69.8 (5.68) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | 70.2 [58.0, 83.4] |
condition | ||||
CVD | (suppressed) | 7 (24.1%) | (suppressed) | 19 (19.0%) |
Dementia | (suppressed) | (suppressed) | 0 (0.0%) | (suppressed) |
DM | 15 (44.1%) | (suppressed) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0.0%) | (suppressed) | (suppressed) | (suppressed) |
Suppress cells with counts fewer than six only
Although the default option shows the desired behaviour, Rgemini can export a function that will simply suppress those cells with counts fewer than six as needed.
table1(
~ gender + age + condition | exposure,
data = data,
render.categorical = render_strict_cell_suppression.categorical
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) |
age | ||||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | 69.8 (5.68) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | 70.2 [58.0, 83.4] |
condition | ||||
CVD | < 6 obs. (suppressed) | 7 (24.1%) | 7 (18.9%) | 19 (19.0%) |
Dementia | < 6 obs. (suppressed) | < 6 obs. (suppressed) | 0 (0.0%) | < 6 obs. (suppressed) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0.0%) | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) |
Suppress missing values
Note that to suppress missing values, we use the
render_cell_suppression.missing()
function:
table1(
~ gender + age + condition | exposure,
data = data,
render.categorical = render_strict_cell_suppression.categorical,
render.missing = render_cell_suppression.missing
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Missing | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) |
age | ||||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | 69.8 (5.68) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | 70.2 [58.0, 83.4] |
condition | ||||
CVD | < 6 obs. (suppressed) | 7 (24.1%) | 7 (18.9%) | 19 (19.0%) |
Dementia | < 6 obs. (suppressed) | < 6 obs. (suppressed) | 0 (0.0%) | < 6 obs. (suppressed) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0.0%) | < 6 obs. (suppressed) | < 6 obs. (suppressed) | < 6 obs. (suppressed) |
Note that there is an issue here. Although we were able to suppress
cells with counts less than six where the gender was missing, it is very
obvious what that count should be (i.e. it can be easily reverse
calculated). One approach to deal with this is to actually code the
missing gender values as a new category for that variable, and apply the
more conservative
render_cell_suppression.categorical()
.
levels(data$gender) <- c("M", "F", "Not Available")
data$gender[is.na(data$gender)] <- "Not Available"
table1(
~ gender + age + condition | exposure,
data = data,
render.categorical = render_cell_suppression.categorical
)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | (suppressed) | (suppressed) | (suppressed) | (suppressed) |
Not Available | (suppressed) | (suppressed) | (suppressed) | (suppressed) |
age | ||||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | 69.8 (5.68) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | 70.2 [58.0, 83.4] |
condition | ||||
CVD | (suppressed) | 7 (24.1%) | (suppressed) | 19 (19.0%) |
Dementia | (suppressed) | (suppressed) | 0 (0.0%) | (suppressed) |
DM | 15 (44.1%) | (suppressed) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0.0%) | (suppressed) | (suppressed) | (suppressed) |
Although it might not make much practical sense in this case, this approach should be taken for any categorical variable with missing values where we would like to employ the most conservative cell suppression strategy for patient privacy.
Strategies to suppress continuous variables
It is also possible to suppress continuous variables if desired. See the example below.
Set up
set.seed(1)
continuous_data <- data.frame(
"age" = rnorm(100, mean = 70, sd = 10),
"laps" = abs(rnorm(100, mean = 1, sd = 1)),
"nobel" = sample(c("nobel prize won", "nobel prize not won"), 100, replace = TRUE, prob = c(0.01, 0.99))
)
table1(~ age + laps | nobel, data = continuous_data)
nobel prize not won (N=99) |
nobel prize won (N=1) |
Overall (N=100) |
|
---|---|---|---|
age | |||
Mean (SD) | 70.9 (8.91) | 85.1 (NA) | 71.1 (8.98) |
Median [Min, Max] | 70.7 [47.9, 94.0] | 85.1 [85.1, 85.1] | 71.1 [47.9, 94.0] |
laps | |||
Mean (SD) | 1.09 (0.819) | 0.364 (NA) | 1.08 (0.818) |
Median [Min, Max] | 0.841 [0.0158, 3.31] | 0.364 [0.364, 0.364] | 0.832 [0.0158, 3.31] |
Suppress continuous variables
We use render_cell_suppression.continuous
to suppress
any summary statistics for groups with a size smaller than six.
table1(
~ age + laps | nobel,
data = continuous_data,
render.continuous = render_cell_suppression.continuous
)
nobel prize not won (N=99) |
nobel prize won (N=1) |
Overall (N=100) |
|
---|---|---|---|
age | |||
Mean (SD) | 70.947 (± 8.915) | < 6 obs. (suppressed) | 71.089 (± 8.982) |
laps | |||
Mean (SD) | 1.090 (± 0.819) | < 6 obs. (suppressed) | 1.083 (± 0.818) |
Note in this case however, using the strata totals we can
reverse-calculate the number of patients, so we suppress the counts in
the strata as well using render_cell_suppression.strat
.
Suppress counts in the strata
table1(
~ age + laps | nobel,
data = continuous_data,
render.continuous = render_cell_suppression.continuous,
render.strat = render_cell_suppression.strat
)
nobel prize not won (N=99) |
nobel prize won (N< 6 obs. (suppressed)) |
Overall (N=100) |
|
---|---|---|---|
age | |||
Mean (SD) | 70.947 (± 8.915) | < 6 obs. (suppressed) | 71.089 (± 8.982) |
laps | |||
Mean (SD) | 1.090 (± 0.819) | < 6 obs. (suppressed) | 1.083 (± 0.818) |
Suppress “Overall” count
We encounter a similar issue where we can reverse calculate the strata total using the overall count. Therefore in this scenario we could consider removing the “Overall” count:
table1(
~ age + laps | nobel,
data = continuous_data,
render.continuous = render_cell_suppression.continuous,
render.strat = render_cell_suppression.strat,
overall = FALSE
)
nobel prize not won (N=99) |
nobel prize won (N< 6 obs. (suppressed)) |
|
---|---|---|
age | ||
Mean (SD) | 70.947 (± 8.915) | < 6 obs. (suppressed) |
laps | ||
Mean (SD) | 1.090 (± 0.819) | < 6 obs. (suppressed) |
Important conceptual notes:
We don’t employ the same strategy for stratification variables as we do for categorical variables, where we successively suppress additional strata until the total number of suppressed individuals is six or more. This is because any categorical variable with six or more observations in each category will indirectly provide the total for that strata. Therefore it is safest to apply strict suppression on the strata totals, and remove the “Overall” column as shown above.
Rounding
We can also specify the number of digits to round means, medians, or
percentages to using the digits
argument. Note that
table1
exposes this by default through the
digits
argument.
table1(
~ gender + age + condition | exposure,
data = data,
digits = 2
)
pandemic (N= 34) |
post-pandemic (N= 29) |
pre-pandemic (N= 37) |
Overall (N=100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) |
Not Available | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) |
age | ||||
Mean (SD) | 70 (5.5) | 70 (5.7) | 70 (5.9) | 70 (5.7) |
Median [Min, Max] | 68 [61, 83] | 71 [58, 80] | 70 [60, 82] | 70 [58, 83] |
condition | ||||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) | 19 (19.0%) |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) | 4 (4.0%) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | 37 (37.0%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) | 3 (3.0%) |
However, the default behaviour of table1
is to round the
total number of digits (otherwise known as “significant
digits”) to the value of digits
. By design, the
Rgemini
render functions round the digits after the
decimal place to the value of digits
. What this means,
is that if combining default table1
render functions with
Rgemini
render functions, there will be a mismatch in the
way that rounding is handled (by default).
table1(
~ gender + age + condition | exposure,
data = data,
render.categorical = render_cell_suppression.categorical,
#render.continuous = render.continuous, <- this is implicitly using table1's default continuous rendering function
digits = 5
)
pandemic (N= 34) |
post-pandemic (N= 29) |
pre-pandemic (N= 37) |
Overall (N= 100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.82353%) | 15 (51.72414%) | 18 (48.64865%) | 53 (53.00000%) |
M | (suppressed) | (suppressed) | (suppressed) | (suppressed) |
Not Available | (suppressed) | (suppressed) | (suppressed) | (suppressed) |
age | ||||
Mean (SD) | 69.527 (5.5269) | 69.723 (5.6948) | 70.220 (5.9286) | 69.840 (5.6769) |
Median [Min, Max] | 68.478 [61.378, 83.383] | 70.639 [57.982, 79.651] | 70.341 [60.088, 82.320] | 70.180 [57.982, 83.383] |
condition | ||||
CVD | (suppressed) | 7 (24.13793%) | (suppressed) | 19 (19.00000%) |
Dementia | (suppressed) | (suppressed) | 0 (0.00000%) | (suppressed) |
DM | 15 (44.11765%) | (suppressed) | 15 (40.54054%) | 37 (37.00000%) |
Pneumonia | 12 (35.29412%) | 11 (37.93103%) | 14 (37.83784%) | 37 (37.00000%) |
DVT | 0 (0.00000%) | (suppressed) | (suppressed) | (suppressed) |
Note above that continuous variables (which are using the default
table1
render function), are rounded to five
significant digits, while the categorical variables (which use
Rgemini's
render function), are rounded to five digits
after the decimal place. While it is generally discouraged to
use cell suppression for only some variable types and not others, we can
synchronize the behaviour of the two render functions by supplying the
argument rounding.fn = round_pad
, which will tell the
table1
render functions to use digits
digits
after the decimal place.
table1(
~ gender + age + condition | exposure,
data = data,
render.categorical = render_cell_suppression.categorical,
#render.continuous = render.continuous, <- this is implicit
digits = 5,
rounding.fn = round_pad
)
pandemic (N= 34) |
post-pandemic (N= 29) |
pre-pandemic (N= 37) |
Overall (N= 100) |
|
---|---|---|---|---|
gender | ||||
F | 20 (58.82353%) | 15 (51.72414%) | 18 (48.64865%) | 53 (53.00000%) |
M | (suppressed) | (suppressed) | (suppressed) | (suppressed) |
Not Available | (suppressed) | (suppressed) | (suppressed) | (suppressed) |
age | ||||
Mean (SD) | 69.52687 (5.52689) | 69.72315 (5.69476) | 70.22003 (5.92857) | 69.84026 (5.67689) |
Median [Min, Max] | 68.47784 [61.37828, 83.38316] | 70.63923 [57.98168, 79.65122] | 70.34099 [60.08826, 82.32028] | 70.18004 [57.98168, 83.38316] |
condition | ||||
CVD | (suppressed) | 7 (24.13793%) | (suppressed) | 19 (19.00000%) |
Dementia | (suppressed) | (suppressed) | 0 (0.00000%) | (suppressed) |
DM | 15 (44.11765%) | (suppressed) | 15 (40.54054%) | 37 (37.00000%) |
Pneumonia | 12 (35.29412%) | 11 (37.93103%) | 14 (37.83784%) | 37 (37.00000%) |
DVT | 0 (0.00000%) | (suppressed) | (suppressed) | (suppressed) |
Calculating standardized mean differences
One primary difference between tableone
and
table1
is the ability to compute and display standardized
mean differences (SMDs). To enable this, Rgemini
exports
max_pairwise_smd()
.
Note that for a stratification variable with two levels, this corresponds to the absolute value of the SMD (no directionality). For a stratification with more than two levels, this is the maximum pairwise SMD.
It can be supplied to table1
through the
extra.col
argument, which takes a named list of extra
columns to append to the table. Usage is as follows:
table1(~ gender + age + condition | exposure, data = data, extra.col = list("Maximum Standardized Mean Difference" = max_pairwise_smd))
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
Overall (N=100) |
Maximum Standardized Mean Difference | |
---|---|---|---|---|---|
gender | 0.295 | ||||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | 53 (53.0%) | |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | 42 (42.0%) | |
Not Available | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | 5 (5.0%) | |
age | 0.121 | ||||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | 69.8 (5.68) | |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | 70.2 [58.0, 83.4] | |
condition | 0.574 | ||||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) | 19 (19.0%) | |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) | 4 (4.0%) | |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | 37 (37.0%) | |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | 37 (37.0%) | |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) | 3 (3.0%) |
Replicating table1
functionality
Note that since these extensions are passed as arguments to
table1
, the entire original API for table1
is
still exposed. Therefore we replicate some of the same examples
here.
Labelling
Change row labels
We change both row labels below. In order to do this, we need to use
the default (i.e. “non-formula”) interface to table1
:
labels <- list(
variables=list(age="Age (years)", gender="Sex", condition="MRDx")
)
strata <- split(data, data$exposure)
table1(strata, labels = labels)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
|
---|---|---|---|
Age (years) | |||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] |
Sex | |||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) |
Not Available | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) |
MRDx | |||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) |
Change strata labels
Next we change the strata labels. To change strata labels we change the names of the levels of the factor variable corresponding to the strata
levels(data$exposure) <- c("During Pandemic", "After Pandemic", "Before Pandemic")
strata <- split(data, data$exposure)
table1(strata, labels = labels)
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
|
---|---|---|---|
Age (years) | |||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] |
Sex | |||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) |
Not Available | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) |
MRDx | |||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) |
Add grouping to strata and standardized mean differences
Next we also add grouping to the strata variable, specifying group
names in the labels
list, as well as a
groupspan
which is used span a label over multiple groups
in the table1
call. We also add standardized mean
differences using extra.col
to put it all together.
labels$groups <- list("After COVID", "")
extra_col <- list()
extra_col$`SMD` <- max_pairwise_smd
table1(strata, labels = labels, groupspan = c(2, 1), extra.col = extra_col)
After COVID |
||||
---|---|---|---|---|
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
SMD | |
Age (years) | 0.121 | |||
Mean (SD) | 69.5 (5.53) | 69.7 (5.69) | 70.2 (5.93) | |
Median [Min, Max] | 68.5 [61.4, 83.4] | 70.6 [58.0, 79.7] | 70.3 [60.1, 82.3] | |
Sex | 0.295 | |||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | |
M | 12 (35.3%) | 12 (41.4%) | 18 (48.6%) | |
Not Available | 2 (5.9%) | 2 (6.9%) | 1 (2.7%) | |
MRDx | 0.574 | |||
CVD | 5 (14.7%) | 7 (24.1%) | 7 (18.9%) | |
Dementia | 2 (5.9%) | 2 (6.9%) | 0 (0%) | |
DM | 15 (44.1%) | 7 (24.1%) | 15 (40.5%) | |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | |
DVT | 0 (0%) | 2 (6.9%) | 1 (2.7%) |
Putting it all together
Now we perform custom labeling, cell suppression, and adding a column of standardized mean differences as follows:
table1(
strata,
labels = labels,
groupspan = c(2, 2),
render = render_cell_suppression.default,
extra.col = list("SMD" = max_pairwise_smd)
)
After COVID |
||||
---|---|---|---|---|
pandemic (N=34) |
post-pandemic (N=29) |
pre-pandemic (N=37) |
SMD | |
Age (years) | 0.121 | |||
Mean (SD) | 69.527 (± 5.527) | 69.723 (± 5.695) | 70.220 (± 5.929) | |
Sex | 0.295 | |||
F | 20 (58.8%) | 15 (51.7%) | 18 (48.6%) | |
M | (suppressed) | (suppressed) | (suppressed) | |
Not Available | (suppressed) | (suppressed) | (suppressed) | |
MRDx | 0.574 | |||
CVD | (suppressed) | 7 (24.1%) | (suppressed) | |
Dementia | (suppressed) | (suppressed) | 0 (0.0%) | |
DM | 15 (44.1%) | (suppressed) | 15 (40.5%) | |
Pneumonia | 12 (35.3%) | 11 (37.9%) | 14 (37.8%) | |
DVT | 0 (0.0%) | (suppressed) | (suppressed) |
Customization and extending functionality
Note that table1
offers much for flexibility than what
is briefly described above with respect to labeling. There is even the
possibility to fine tune the table’s appearance with CSS. See more
examples in the vignette
here.
There is also the possibility to define new render functions and new extra columns.
Creating render functions
Note that render functions must be designed to take as input a single
argument corresponding to a vector (numeric
or
factor
), and output a named character
vector,
where the first element is unnamed and always ""
. See the
examples below for how the output should be formatted:
library(dplyr)
x <- mtcars$am %>% as.factor()
render_cell_suppression.categorical(as.factor(mtcars$am))
## 0 1
## "" "19 (59.4%)" "13 (40.6%)"
## Mean (SD)
## "" "10.500 (± 5.916)"
render_cell_suppression.missing(c(1:19, NA))
## Missing
## "" "< 6 obs. (suppressed)"
Creating extra columns
Note that functions defined for extra columns must always take a
named list as the first argument, and ellipses (...
) as the
last. The named list corresponds to a variable in the data split by the
stratifying variable, such as in the example below:
x <- split(mtcars$disp, mtcars$am)
my_max_col <- function(x, ...) {
lapply(x, max) %>% unlist() %>% max()
}
my_max_col(x)
## [1] 472
Now this newly defined function can be added as an extra column.
0 (N=19) |
1 (N=13) |
Overall (N=32) |
My Max | |
---|---|---|---|---|
disp | 472 | |||
Mean (SD) | 290 (110) | 144 (87.2) | 231 (124) | |
Median [Min, Max] | 276 [120, 472] | 120 [71.1, 351] | 196 [71.1, 472] |