Cell Suppression and Table 1

Introduction

Cell suppression is a technique to withhold or “suppress” confidential or identifying data in tabular formats.

It is common practice in clinical and epidemiological research to present a table of baseline patient characteristics in the study population. This is commonly referred to as Table 1.

In order to preserve patient privacy and reduce the risk of re-identification, GEMINI employs a cell suppression strategy for any Table 1 that is published or shared externally.

The general rule is to hide any subgroup with fewer than 6 patients (<= 5). There are some special cases which will be discussed in detail below.

Note that there exists a variety of methods to produce a Table 1. Below are some examples:

Both packages provide their own functionality, but neither implement cell suppression. Due to the design of table1, it was chosen as a candidate for this added functionality, implemented in various functions described below.

Setup

Dummy data

For this vignette, we will create a dummy dataset to summarize:

set.seed(3)

gender <- c(sample(c("M", "F"), size = 95, replace = TRUE), rep(NA, times = 5))

exposure <- sample(
  c("pre-pandemic", "pandemic", "post-pandemic"),
  size = 100, replace = TRUE
) |> factor(levels = c("pre-pandemic", "pandemic", "post-pandemic"))

age <- rnorm(100, mean = 70, sd = 5)

condition <- sample(
  c("DVT", "CVD", "DM", "Pneumonia", "Dementia"),
  size = 100, replace = TRUE, prob = c(0.01, 0.20, 0.4, 0.35, 0.05)
)

data <- data.frame(gender = gender, exposure = exposure, age = age, condition = condition)

Using `table1`

Although the vignette linked above provides an in-depth exploration of all table1 functionality, we will review the function with a very simple example below. We summarize all available patient characteristics, stratifying by the exposure variable:

library(table1)

table1(~ gender + age + condition | exposure, data = data)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)
Missing	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)	69.8 (5.68)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]	70.2 [58.0, 83.4]
condition
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)	37 (37.0%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)	3 (3.0%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)	4 (4.0%)

Note that patients with particular conditions can be easily identified, sometimes to the precision of the exact patient (such as the patient with DVT pre-pandemic). The goal of this vignette is to show how to protect these patients’ privacy.

Cell suppression

Default

In order to perform cell suppression with table1(), Rgemini exports render_cell_suppression.default().

library(Rgemini)

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render = render_cell_suppression.default
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)

Missing	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)
age
Mean (SD)	70.220 (± 5.929)	69.527 (± 5.527)	69.723 (± 5.695)	69.840 (± 5.677)
condition
CVD	(suppressed)	(suppressed)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	(suppressed)	37 (37.0%)
DVT	(suppressed)	0 (0.0%)	(suppressed)	(suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0.0%)	(suppressed)	(suppressed)	(suppressed)

Important conceptual notes:

True zeros are not suppressed.

True zeros risk identifying no patient, and therefore we do not hide valuable information that does not bring with it the risk of re-identification.

Cells that originally had 6 or more patients are also suppressed.

Consider the group of patients in the pre-pandemic subgroup. Note that patients with CVD pre-pandemic were suppressed even though there were originally seven patients in that subgroup. Why is that? Had we only suppressed patients with DVT (one patient), knowing the total number of patients pre-pandemic (37), we could have reverse-calculated the number of patients with DVT (with reasonable precision).

Therefore the algorithm is designed to continue suppressing successively larger groups until the total number of suppressed patients is six or more, such that they cannot be reverse-calculated with reasonable precision. The default option shows the desired behaviour and the most conservative with respect to patient privacy.

Note that in this case, the number of DM patients in post-pandemic group can still be reverse-calculated, as the total number of DM patients is 37, while the other two groups have 15 each. To prevent this, we can consider removing the “Overall” column by setting overall = FALSE in table1() function call. See the suppressing “Overall” column section for an example.

Customizing output for continuous variables - change to medians

We can specify whether to use a median or mean for continuous variables by specifying a continuous_fn.

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render = render_cell_suppression.default,
  continuous_fn = "median"
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)

Missing	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)
age
Median [Q1, Q3]	70.341 [65.556, 74.247]	68.478 [65.871, 72.230]	70.639 [65.050, 73.640]	70.180 [65.335, 73.544]
condition
CVD	(suppressed)	(suppressed)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	(suppressed)	37 (37.0%)
DVT	(suppressed)	0 (0.0%)	(suppressed)	(suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0.0%)	(suppressed)	(suppressed)	(suppressed)

Note however that we can render medians without enabling cell suppression, if desired.

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.continuous = render_median.continuous
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)
Missing	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age
Median [Q1, Q3]	70.341 [65.556, 74.247]	68.478 [65.871, 72.230]	70.639 [65.050, 73.640]	70.180 [65.335, 73.544]
condition
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)	37 (37.0%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)	3 (3.0%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)	4 (4.0%)

We can also specify a render function only for a particular variable type:

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.continuous = render_cell_suppression.continuous,
  continuous_fn = "median"
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)
Missing	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age
Median [Q1, Q3]	70.341 [65.556, 74.247]	68.478 [65.871, 72.230]	70.639 [65.050, 73.640]	70.180 [65.335, 73.544]
condition
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)	37 (37.0%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)	3 (3.0%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)	4 (4.0%)

Finer control

render_cell_suppression.default is a wrapper around individual render functions for each covariate datatype. For finer control, use the primary cell suppression functionality as implemented in the following render functions:

We can apply cell suppression directly to covariates of particular data types by supplying these custom renderer functions.

Suppress categorical variables only

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.categorical = render_cell_suppression.categorical
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)
Missing	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)	69.8 (5.68)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]	70.2 [58.0, 83.4]
condition
CVD	(suppressed)	(suppressed)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	(suppressed)	37 (37.0%)
DVT	(suppressed)	0 (0.0%)	(suppressed)	(suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0.0%)	(suppressed)	(suppressed)	(suppressed)

We may also want to only display a single level for binary variables (such as with gender). We can do this through the optional single_level_binary variable.

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.categorical = render_cell_suppression.categorical,
  single_level_binary = TRUE
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
Missing	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)	69.8 (5.68)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]	70.2 [58.0, 83.4]
condition
CVD	(suppressed)	(suppressed)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	(suppressed)	37 (37.0%)
DVT	(suppressed)	0 (0.0%)	(suppressed)	(suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0.0%)	(suppressed)	(suppressed)	(suppressed)

Suppress cells with counts fewer than six only

Although the default option shows the desired behaviour, Rgemini can export a function that will simply suppress those cells with counts fewer than six as needed.

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.categorical = render_strict_cell_suppression.categorical
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)
Missing	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)	69.8 (5.68)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]	70.2 [58.0, 83.4]
condition
CVD	7 (18.9%)	< 6 obs. (suppressed)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)	37 (37.0%)
DVT	< 6 obs. (suppressed)	0 (0.0%)	< 6 obs. (suppressed)	< 6 obs. (suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0.0%)	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)

Suppress missing values

Note that to suppress missing values, we use the render_cell_suppression.missing() function:

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.categorical = render_strict_cell_suppression.categorical,
  render.missing = render_cell_suppression.missing
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)

Missing	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)
age
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)	69.8 (5.68)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]	70.2 [58.0, 83.4]
condition
CVD	7 (18.9%)	< 6 obs. (suppressed)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)	37 (37.0%)
DVT	< 6 obs. (suppressed)	0 (0.0%)	< 6 obs. (suppressed)	< 6 obs. (suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0.0%)	< 6 obs. (suppressed)	< 6 obs. (suppressed)	< 6 obs. (suppressed)

Note that there is an issue here. Although we were able to suppress cells with counts less than six where the gender was missing, it is very obvious what that count should be (i.e. it can be easily reverse calculated). One approach to deal with this is to actually code the missing gender values as a new category for that variable, and apply the more conservative render_cell_suppression.categorical().

levels(data$gender) <- c("M", "F", "Not Available")
data$gender[is.na(data$gender)] <- "Not Available"

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.categorical = render_cell_suppression.categorical
)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	(suppressed)	(suppressed)	(suppressed)	(suppressed)
Not Available	(suppressed)	(suppressed)	(suppressed)	(suppressed)
age
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)	69.8 (5.68)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]	70.2 [58.0, 83.4]
condition
CVD	(suppressed)	(suppressed)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	(suppressed)	37 (37.0%)
DVT	(suppressed)	0 (0.0%)	(suppressed)	(suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0.0%)	(suppressed)	(suppressed)	(suppressed)

Although it might not make much practical sense in this case, this approach should be taken for any categorical variable with missing values where we would like to employ the most conservative cell suppression strategy for patient privacy.

Strategies to suppress continuous variables

It is also possible to suppress continuous variables if desired. See the example below.

Set up

set.seed(1)

continuous_data <- data.frame(
  "age" = rnorm(100, mean = 70, sd = 10),
  "laps" = abs(rnorm(100, mean = 1, sd = 1)),
  "nobel" = sample(c("nobel prize won", "nobel prize not won"), 100, replace = TRUE, prob = c(0.01, 0.99))
)

table1(~ age + laps | nobel, data = continuous_data)

	nobel prize not won (N=99)	nobel prize won (N=1)	Overall (N=100)
age
Mean (SD)	70.9 (8.91)	85.1 (NA)	71.1 (8.98)
Median [Min, Max]	70.7 [47.9, 94.0]	85.1 [85.1, 85.1]	71.1 [47.9, 94.0]
laps
Mean (SD)	1.09 (0.819)	0.364 (NA)	1.08 (0.818)
Median [Min, Max]	0.841 [0.0158, 3.31]	0.364 [0.364, 0.364]	0.832 [0.0158, 3.31]

Suppress continuous variables

We use render_cell_suppression.continuous to suppress any summary statistics for groups with a size smaller than six.

table1(
  ~ age + laps | nobel,
  data = continuous_data,
  render.continuous = render_cell_suppression.continuous
)

	nobel prize not won (N=99)	nobel prize won (N=1)	Overall (N=100)
age
Mean (SD)	70.947 (± 8.915)	< 6 obs. (suppressed)	71.089 (± 8.982)
laps
Mean (SD)	1.090 (± 0.819)	< 6 obs. (suppressed)	1.083 (± 0.818)

Note in this case however, using the strata totals we can reverse-calculate the number of patients, so we suppress the counts in the strata as well using render_cell_suppression.strat.

Suppress counts in the strata

table1(
  ~ age + laps | nobel,
  data = continuous_data,
  render.continuous = render_cell_suppression.continuous,
  render.strat = render_cell_suppression.strat
)

	nobel prize not won (N=99)	nobel prize won (N< 6 obs. (suppressed))	Overall (N=100)
age
Mean (SD)	70.947 (± 8.915)	< 6 obs. (suppressed)	71.089 (± 8.982)
laps
Mean (SD)	1.090 (± 0.819)	< 6 obs. (suppressed)	1.083 (± 0.818)

Suppress “Overall” count

We encounter a similar issue where we can reverse calculate the strata total using the overall count. Therefore in this scenario we could consider removing the “Overall” count:

table1(
  ~ age + laps | nobel,
  data = continuous_data,
  render.continuous = render_cell_suppression.continuous,
  render.strat = render_cell_suppression.strat,
  overall = FALSE
)

	nobel prize not won (N=99)	nobel prize won (N< 6 obs. (suppressed))
age
Mean (SD)	70.947 (± 8.915)	< 6 obs. (suppressed)
laps
Mean (SD)	1.090 (± 0.819)	< 6 obs. (suppressed)

Important conceptual notes:

We don’t employ the same strategy for stratification variables as we do for categorical variables, where we successively suppress additional strata until the total number of suppressed individuals is six or more. This is because any categorical variable with six or more observations in each category will indirectly provide the total for that strata. Therefore it is safest to apply strict suppression on the strata totals, and remove the “Overall” column as shown above.

Rounding

We can also specify the number of digits to round means, medians, or percentages to using the digits argument. Note that table1 exposes this by default through the digits argument.

table1(
  ~ gender + age + condition | exposure,
  data = data,
  digits = 2
)

	pre-pandemic (N= 37)	pandemic (N= 34)	post-pandemic (N= 29)	Overall (N=100)
gender
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)
Not Available	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age
Mean (SD)	70 (5.9)	70 (5.5)	70 (5.7)	70 (5.7)
Median [Min, Max]	70 [60, 82]	68 [61, 83]	71 [58, 80]	70 [58, 83]
condition
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)	37 (37.0%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)	3 (3.0%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)	4 (4.0%)

However, the default behaviour of table1 is to round the total number of digits (otherwise known as “significant digits”) to the value of digits. By design, the Rgemini render functions round the digits after the decimal place to the value of digits. What this means, is that if combining default table1 render functions with Rgemini render functions, there will be a mismatch in the way that rounding is handled (by default).

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.categorical = render_cell_suppression.categorical,
  # render.continuous = render.continuous, <- this is implicitly using table1's default continuous rendering function
  digits = 5
)

	pre-pandemic (N= 37)	pandemic (N= 34)	post-pandemic (N= 29)	Overall (N= 100)
gender
F	18 (48.64865%)	20 (58.82353%)	15 (51.72414%)	53 (53.00000%)
M	(suppressed)	(suppressed)	(suppressed)	(suppressed)
Not Available	(suppressed)	(suppressed)	(suppressed)	(suppressed)
age
Mean (SD)	70.220 (5.9286)	69.527 (5.5269)	69.723 (5.6948)	69.840 (5.6769)
Median [Min, Max]	70.341 [60.088, 82.320]	68.478 [61.378, 83.383]	70.639 [57.982, 79.651]	70.180 [57.982, 83.383]
condition
CVD	(suppressed)	(suppressed)	7 (24.13793%)	19 (19.00000%)
DM	15 (40.54054%)	15 (44.11765%)	(suppressed)	37 (37.00000%)
DVT	(suppressed)	0 (0.00000%)	(suppressed)	(suppressed)
Pneumonia	14 (37.83784%)	12 (35.29412%)	11 (37.93103%)	37 (37.00000%)
Dementia	0 (0.00000%)	(suppressed)	(suppressed)	(suppressed)

Note above that continuous variables (which are using the default table1 render function), are rounded to five significant digits, while the categorical variables (which use Rgemini's render function), are rounded to five digits after the decimal place. While it is generally discouraged to use cell suppression for only some variable types and not others, we can synchronize the behaviour of the two render functions by supplying the argument rounding.fn = round_pad, which will tell the table1 render functions to use digits digits after the decimal place.

table1(
  ~ gender + age + condition | exposure,
  data = data,
  render.categorical = render_cell_suppression.categorical,
  # render.continuous = render.continuous, <- this is implicit
  digits = 5,
  rounding.fn = round_pad
)

	pre-pandemic (N= 37)	pandemic (N= 34)	post-pandemic (N= 29)	Overall (N= 100)
gender
F	18 (48.64865%)	20 (58.82353%)	15 (51.72414%)	53 (53.00000%)
M	(suppressed)	(suppressed)	(suppressed)	(suppressed)
Not Available	(suppressed)	(suppressed)	(suppressed)	(suppressed)
age
Mean (SD)	70.22003 (5.92857)	69.52687 (5.52689)	69.72315 (5.69476)	69.84026 (5.67689)
Median [Min, Max]	70.34099 [60.08826, 82.32028]	68.47784 [61.37828, 83.38316]	70.63923 [57.98168, 79.65122]	70.18004 [57.98168, 83.38316]
condition
CVD	(suppressed)	(suppressed)	7 (24.13793%)	19 (19.00000%)
DM	15 (40.54054%)	15 (44.11765%)	(suppressed)	37 (37.00000%)
DVT	(suppressed)	0 (0.00000%)	(suppressed)	(suppressed)
Pneumonia	14 (37.83784%)	12 (35.29412%)	11 (37.93103%)	37 (37.00000%)
Dementia	0 (0.00000%)	(suppressed)	(suppressed)	(suppressed)

Calculating standardized mean differences

One primary difference between tableone and table1 is the ability to compute and display standardized mean differences (SMDs). To enable this, Rgemini exports max_pairwise_smd().

Note that for a stratification variable with two levels, this corresponds to the absolute value of the SMD (no directionality). For a stratification with more than two levels, this is the maximum pairwise SMD.

It can be supplied to table1 through the extra.col argument, which takes a named list of extra columns to append to the table. Usage is as follows:

table1(~ gender + age + condition | exposure, data = data, extra.col = list("Maximum Standardized Mean Difference" = max_pairwise_smd))

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)	Overall (N=100)	Maximum Standardized Mean Difference
gender					0.295
F	18 (48.6%)	20 (58.8%)	15 (51.7%)	53 (53.0%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)	42 (42.0%)
Not Available	1 (2.7%)	2 (5.9%)	2 (6.9%)	5 (5.0%)
age					0.121
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)	69.8 (5.68)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]	70.2 [58.0, 83.4]
condition					0.574
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)	19 (19.0%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)	37 (37.0%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)	3 (3.0%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)	37 (37.0%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)	4 (4.0%)

Replicating `table1` functionality

Note that since these extensions are passed as arguments to table1, the entire original API for table1 is still exposed. Therefore we replicate some of the same examples here.

Labelling

Change row labels

We change both row labels below. In order to do this, we need to use the default (i.e. “non-formula”) interface to table1:

labels <- list(
  variables = list(age = "Age (years)", gender = "Sex", condition = "MRDx")
)

strata <- split(data, data$exposure)
table1(strata, labels = labels)

	pre-pandemic (N=37)	pandemic (N=34)	post-pandemic (N=29)
Age (years)
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]
Sex
F	18 (48.6%)	20 (58.8%)	15 (51.7%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)
Not Available	1 (2.7%)	2 (5.9%)	2 (6.9%)
MRDx
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)

Change strata labels

Next we change the strata labels. To change strata labels we change the names of the levels of the factor variable corresponding to the strata

levels(data$exposure) <- c("Before Pandemic", "During Pandemic", "After Pandemic")
strata <- split(data, data$exposure)

table1(strata, labels = labels)

	Before Pandemic (N=37)	During Pandemic (N=34)	After Pandemic (N=29)
Age (years)
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]
Sex
F	18 (48.6%)	20 (58.8%)	15 (51.7%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)
Not Available	1 (2.7%)	2 (5.9%)	2 (6.9%)
MRDx
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)

Add grouping to strata and standardized mean differences

Next we also add grouping to the strata variable, specifying group names in the labels list, as well as a groupspan which is used span a label over multiple groups in the table1 call. We also add standardized mean differences using extra.col to put it all together.

labels$groups <- list("", "Since COVID")

extra_col <- list()
extra_col$`SMD` <- max_pairwise_smd

table1(strata, labels = labels, groupspan = c(1, 2), extra.col = extra_col)

		Since COVID
	Before Pandemic (N=37)	During Pandemic (N=34)	After Pandemic (N=29)	SMD
Age (years)				0.121
Mean (SD)	70.2 (5.93)	69.5 (5.53)	69.7 (5.69)
Median [Min, Max]	70.3 [60.1, 82.3]	68.5 [61.4, 83.4]	70.6 [58.0, 79.7]
Sex				0.295
F	18 (48.6%)	20 (58.8%)	15 (51.7%)
M	18 (48.6%)	12 (35.3%)	12 (41.4%)
Not Available	1 (2.7%)	2 (5.9%)	2 (6.9%)
MRDx				0.574
CVD	7 (18.9%)	5 (14.7%)	7 (24.1%)
DM	15 (40.5%)	15 (44.1%)	7 (24.1%)
DVT	1 (2.7%)	0 (0%)	2 (6.9%)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)
Dementia	0 (0%)	2 (5.9%)	2 (6.9%)

Putting it all together

Now we perform custom labeling, cell suppression, and adding a column of standardized mean differences as follows:

table1(
  strata,
  labels = labels,
  groupspan = c(1, 2),
  render = render_cell_suppression.default,
  extra.col = list("SMD" = max_pairwise_smd)
)

		Since COVID
	Before Pandemic (N=37)	During Pandemic (N=34)	After Pandemic (N=29)	SMD
Age (years)				0.121
Mean (SD)	70.220 (± 5.929)	69.527 (± 5.527)	69.723 (± 5.695)
Sex				0.295
F	18 (48.6%)	20 (58.8%)	15 (51.7%)
M	(suppressed)	(suppressed)	(suppressed)
Not Available	(suppressed)	(suppressed)	(suppressed)
MRDx				0.574
CVD	(suppressed)	(suppressed)	7 (24.1%)
DM	15 (40.5%)	15 (44.1%)	(suppressed)
DVT	(suppressed)	0 (0.0%)	(suppressed)
Pneumonia	14 (37.8%)	12 (35.3%)	11 (37.9%)
Dementia	0 (0.0%)	(suppressed)	(suppressed)

Customization and extending functionality

Note that table1 offers much for flexibility than what is briefly described above with respect to labeling. There is even the possibility to fine tune the table’s appearance with CSS. See more examples in the vignette here.

There is also the possibility to define new render functions and new extra columns.

Creating render functions

Note that render functions must be designed to take as input a single argument corresponding to a vector (numeric or factor), and output a named character vector, where the first element is unnamed and always "". See the examples below for how the output should be formatted:

library(dplyr)

x <- mtcars$am %>% as.factor()
render_cell_suppression.categorical(as.factor(mtcars$am))

##                         0            1 
##           "" "19 (59.4%)" "13 (40.6%)"

render_cell_suppression.continuous(1:20)

##                                           Mean (SD) 
##                        "" "10.500 (&plusmn; 5.916)"

render_cell_suppression.missing(c(1:19, NA))

##                                               Missing 
##                         "" "&lt; 6 obs. (suppressed)"

Creating extra columns

Note that functions defined for extra columns must always take a named list as the first argument, and ellipses (...) as the last. The named list corresponds to a variable in the data split by the stratifying variable, such as in the example below:

x <- split(mtcars$disp, mtcars$am)

my_max_col <- function(x, ...) {
  lapply(x, max) %>%
    unlist() %>%
    max()
}

my_max_col(x)

## [1] 472

Now this newly defined function can be added as an extra column.

table1(~ disp | am, data = mtcars, extra.col = list("My Max" = my_max_col))

	0 (N=19)	1 (N=13)	Overall (N=32)	My Max
disp				472
Mean (SD)	290 (110)	144 (87.2)	231 (124)
Median [Min, Max]	276 [120, 472]	120 [71.1, 351]	196 [71.1, 472]