Skip to contents

Introduction

Cell suppression is a technique to withhold or “suppress” confidential or identifying data in tabular formats.

It is common practice in clinical and epidemiological research to present a table of baseline patient characteristics in the study population. This is commonly referred to as Table 1.

In order to preserve patient privacy and reduce the risk of re-identification, GEMINI employs a cell suppression strategy for any Table 1 that is published or shared externally.

The general rule is to hide any subgroup with fewer than 6 patients (<= 5). There are some special cases which will be discussed in detail below.

Note that there exists a variety of methods to produce a Table 1. Below are some examples:

Both packages provide their own functionality, but neither implement cell suppression. Due to the design of table1, it was chosen as a candidate for this added functionality, implemented in various functions described below.

Setup

Dummy data

For this vignette, we will create a dummy dataset to summarize:

set.seed(3)

gender <- c(sample(c("M", "F"), size = 95, replace = TRUE), rep(NA, times = 5))

exposure <- sample(
  c("pre-pandemic", "pandemic", "post-pandemic"), 
  size = 100, replace = TRUE
  )

age <- rnorm(100, mean = 70, sd = 5)

condition <- sample(
  c("DVT", "CVD", "DM", "Pneumonia", "Dementia"), 
  size = 100, replace = TRUE, prob = c(0.01, 0.20, 0.4, 0.35, 0.05)
  )

data <- data.frame(gender = gender, exposure = exposure, age = age, condition = condition)

Using table1

Although the vignette linked above provides an in-depth exploration of all table1 functionality, we will review the function with a very simple example below. We summarize all available patient characteristics, stratifying by the exposure variable:

library(table1)

table1(~ gender + age + condition | exposure, data = data)
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93) 69.8 (5.68)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3] 70.2 [58.0, 83.4]
condition
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%) 19 (19.0%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%) 4 (4.0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%) 3 (3.0%)

 

Note that patients with particular conditions can be easily identified, sometimes to the precision of the exact patient (such as the patient with DVT pre-pandemic). The goal of this vignette is to show how to protect these patients’ privacy.

Cell suppression

Default

In order to perform cell suppression with table1(), Rgemini exports render_cell_suppression.default().

library(Rgemini)

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render = render_cell_suppression.default
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed)
age
Mean (SD) 69.527 (± 5.527) 69.723 (± 5.695) 70.220 (± 5.929) 69.840 (± 5.677)
condition
CVD (suppressed) 7 (24.1%) (suppressed) 19 (19.0%)
Dementia (suppressed) (suppressed) 0 (0.0%) (suppressed)
DM 15 (44.1%) (suppressed) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0.0%) (suppressed) (suppressed) (suppressed)

 

Important conceptual notes:

  1. True zeros are not suppressed.

True zeros risk identifying no patient, and therefore we do not hide valuable information that does not bring with it the risk of re-identification.

  1. Cells that originally had 6 or more patients are also suppressed.

Consider the group of patients in the post-pandemic subgroup. Note that patients with CVD post-pandemic were suppressed even though there were originally seven patients in that subgroup. Why is that? Had we only suppressed patients with DVT (one patient), knowing the total number of patients post-pandemic (37), we could have reverse-calculated the number of patients with DVT (with reasonable precision).

Therefore the algorithm is designed to continue suppressing successively larger groups until the total number of suppressed patients is six or more, such that they cannot be reverse-calculated with reasonable precision. The default option shows the desired behaviour and the most conservative with respect to patient privacy.

Customizing output for continuous variables - change to medians

We can specify whether to use a median or mean for continuous variables by specifying a continuous_fn.

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render = render_cell_suppression.default, 
  continuous_fn = 'median'
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed)
age
Median [Q1, Q3] 68.478 [65.871, 72.230] 70.639 [65.050, 73.640] 70.341 [65.556, 74.247] 70.180 [65.335, 73.544]
condition
CVD (suppressed) 7 (24.1%) (suppressed) 19 (19.0%)
Dementia (suppressed) (suppressed) 0 (0.0%) (suppressed)
DM 15 (44.1%) (suppressed) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0.0%) (suppressed) (suppressed) (suppressed)

 

Note however that we can render medians without enabling cell suppression, if desired.

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render.continuous = render_median.continuous
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age
Median [Q1, Q3] 68.478 [65.871, 72.230] 70.639 [65.050, 73.640] 70.341 [65.556, 74.247] 70.180 [65.335, 73.544]
condition
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%) 19 (19.0%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%) 4 (4.0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%) 3 (3.0%)

 

We can also specify a render function only for a particular variable type:

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render.continuous = render_cell_suppression.continuous, 
  continuous_fn = 'median'
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age
Median [Q1, Q3] 68.478 [65.871, 72.230] 70.639 [65.050, 73.640] 70.341 [65.556, 74.247] 70.180 [65.335, 73.544]
condition
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%) 19 (19.0%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%) 4 (4.0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%) 3 (3.0%)

 

Finer control

render_cell_suppression.default is a wrapper around individual render functions for each covariate datatype. For finer control, use the primary cell suppression functionality as implemented in the following render functions:

We can apply cell suppression directly to covariates of particular data types by supplying these custom renderer functions.

Suppress categorical variables only

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render.categorical = render_cell_suppression.categorical
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93) 69.8 (5.68)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3] 70.2 [58.0, 83.4]
condition
CVD (suppressed) 7 (24.1%) (suppressed) 19 (19.0%)
Dementia (suppressed) (suppressed) 0 (0.0%) (suppressed)
DM 15 (44.1%) (suppressed) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0.0%) (suppressed) (suppressed) (suppressed)

 

We may also want to only display a single level for binary variables (such as with gender). We can do this through the optional single_level_binary variable.

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render.categorical = render_cell_suppression.categorical,
  single_level_binary = TRUE
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
Missing 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93) 69.8 (5.68)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3] 70.2 [58.0, 83.4]
condition
CVD (suppressed) 7 (24.1%) (suppressed) 19 (19.0%)
Dementia (suppressed) (suppressed) 0 (0.0%) (suppressed)
DM 15 (44.1%) (suppressed) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0.0%) (suppressed) (suppressed) (suppressed)

Suppress cells with counts fewer than six only

Although the default option shows the desired behaviour, Rgemini can export a function that will simply suppress those cells with counts fewer than six as needed.

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render.categorical = render_strict_cell_suppression.categorical
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93) 69.8 (5.68)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3] 70.2 [58.0, 83.4]
condition
CVD < 6 obs. (suppressed) 7 (24.1%) 7 (18.9%) 19 (19.0%)
Dementia < 6 obs. (suppressed) < 6 obs. (suppressed) 0 (0.0%) < 6 obs. (suppressed)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0.0%) < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed)

 

Suppress missing values

Note that to suppress missing values, we use the render_cell_suppression.missing() function:

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render.categorical = render_strict_cell_suppression.categorical,
  render.missing = render_cell_suppression.missing
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Missing < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed)
age
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93) 69.8 (5.68)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3] 70.2 [58.0, 83.4]
condition
CVD < 6 obs. (suppressed) 7 (24.1%) 7 (18.9%) 19 (19.0%)
Dementia < 6 obs. (suppressed) < 6 obs. (suppressed) 0 (0.0%) < 6 obs. (suppressed)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0.0%) < 6 obs. (suppressed) < 6 obs. (suppressed) < 6 obs. (suppressed)

 

Note that there is an issue here. Although we were able to suppress cells with counts less than six where the gender was missing, it is very obvious what that count should be (i.e. it can be easily reverse calculated). One approach to deal with this is to actually code the missing gender values as a new category for that variable, and apply the more conservative render_cell_suppression.categorical().

levels(data$gender) <- c("M", "F", "Not Available")
data$gender[is.na(data$gender)] <- "Not Available"

table1(
  ~ gender + age + condition | exposure, 
  data = data, 
  render.categorical = render_cell_suppression.categorical
  )
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M (suppressed) (suppressed) (suppressed) (suppressed)
Not Available (suppressed) (suppressed) (suppressed) (suppressed)
age
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93) 69.8 (5.68)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3] 70.2 [58.0, 83.4]
condition
CVD (suppressed) 7 (24.1%) (suppressed) 19 (19.0%)
Dementia (suppressed) (suppressed) 0 (0.0%) (suppressed)
DM 15 (44.1%) (suppressed) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0.0%) (suppressed) (suppressed) (suppressed)

 

Although it might not make much practical sense in this case, this approach should be taken for any categorical variable with missing values where we would like to employ the most conservative cell suppression strategy for patient privacy.

Strategies to suppress continuous variables

It is also possible to suppress continuous variables if desired. See the example below.

Set up
set.seed(1)

continuous_data <- data.frame(
  "age" = rnorm(100, mean = 70, sd = 10),
  "laps" = abs(rnorm(100, mean = 1, sd = 1)),
  "nobel" = sample(c("nobel prize won", "nobel prize not won"), 100, replace = TRUE, prob = c(0.01, 0.99))
)

table1(~ age + laps | nobel, data = continuous_data)
nobel prize not won
(N=99)
nobel prize won
(N=1)
Overall
(N=100)
age
Mean (SD) 70.9 (8.91) 85.1 (NA) 71.1 (8.98)
Median [Min, Max] 70.7 [47.9, 94.0] 85.1 [85.1, 85.1] 71.1 [47.9, 94.0]
laps
Mean (SD) 1.09 (0.819) 0.364 (NA) 1.08 (0.818)
Median [Min, Max] 0.841 [0.0158, 3.31] 0.364 [0.364, 0.364] 0.832 [0.0158, 3.31]

 

Suppress continuous variables

We use render_cell_suppression.continuous to suppress any summary statistics for groups with a size smaller than six.

table1(
  ~ age + laps | nobel, 
  data = continuous_data, 
  render.continuous = render_cell_suppression.continuous
  )
nobel prize not won
(N=99)
nobel prize won
(N=1)
Overall
(N=100)
age
Mean (SD) 70.947 (± 8.915) < 6 obs. (suppressed) 71.089 (± 8.982)
laps
Mean (SD) 1.090 (± 0.819) < 6 obs. (suppressed) 1.083 (± 0.818)

 

Note in this case however, using the strata totals we can reverse-calculate the number of patients, so we suppress the counts in the strata as well using render_cell_suppression.strat.

Suppress counts in the strata
table1(
  ~ age + laps | nobel, 
  data = continuous_data, 
  render.continuous = render_cell_suppression.continuous,
  render.strat = render_cell_suppression.strat
  )
nobel prize not won
(N=99)
nobel prize won
(N< 6 obs. (suppressed))
Overall
(N=100)
age
Mean (SD) 70.947 (± 8.915) < 6 obs. (suppressed) 71.089 (± 8.982)
laps
Mean (SD) 1.090 (± 0.819) < 6 obs. (suppressed) 1.083 (± 0.818)

 

Suppress “Overall” count

We encounter a similar issue where we can reverse calculate the strata total using the overall count. Therefore in this scenario we could consider removing the “Overall” count:

table1(
  ~ age + laps | nobel, 
  data = continuous_data, 
  render.continuous = render_cell_suppression.continuous,
  render.strat = render_cell_suppression.strat,
  overall = FALSE
  )
nobel prize not won
(N=99)
nobel prize won
(N< 6 obs. (suppressed))
age
Mean (SD) 70.947 (± 8.915) < 6 obs. (suppressed)
laps
Mean (SD) 1.090 (± 0.819) < 6 obs. (suppressed)

 

Important conceptual notes:

We don’t employ the same strategy for stratification variables as we do for categorical variables, where we successively suppress additional strata until the total number of suppressed individuals is six or more. This is because any categorical variable with six or more observations in each category will indirectly provide the total for that strata. Therefore it is safest to apply strict suppression on the strata totals, and remove the “Overall” column as shown above.

Rounding

We can also specify the number of digits to round means, medians, or percentages to using the digits argument. Note that table1 exposes this by default through the digits argument.

table1(
  ~ gender + age + condition | exposure, 
  data = data,
  digits = 2
  )
pandemic
(N= 34)
post-pandemic
(N= 29)
pre-pandemic
(N= 37)
Overall
(N=100)
gender
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Not Available 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age
Mean (SD) 70 (5.5) 70 (5.7) 70 (5.9) 70 (5.7)
Median [Min, Max] 68 [61, 83] 71 [58, 80] 70 [60, 82] 70 [58, 83]
condition
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%) 19 (19.0%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%) 4 (4.0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%) 3 (3.0%)

 

However, the default behaviour of table1 is to round the total number of digits (otherwise known as “significant digits”) to the value of digits. By design, the Rgemini render functions round the digits after the decimal place to the value of digits. What this means, is that if combining default table1 render functions with Rgemini render functions, there will be a mismatch in the way that rounding is handled (by default).

table1(
  ~ gender + age + condition | exposure, 
  data = data,
  render.categorical = render_cell_suppression.categorical,
  #render.continuous = render.continuous, <- this is implicitly using table1's default continuous rendering function
  digits = 5
  )
pandemic
(N= 34)
post-pandemic
(N= 29)
pre-pandemic
(N= 37)
Overall
(N= 100)
gender
F 20 (58.82353%) 15 (51.72414%) 18 (48.64865%) 53 (53.00000%)
M (suppressed) (suppressed) (suppressed) (suppressed)
Not Available (suppressed) (suppressed) (suppressed) (suppressed)
age
Mean (SD) 69.527 (5.5269) 69.723 (5.6948) 70.220 (5.9286) 69.840 (5.6769)
Median [Min, Max] 68.478 [61.378, 83.383] 70.639 [57.982, 79.651] 70.341 [60.088, 82.320] 70.180 [57.982, 83.383]
condition
CVD (suppressed) 7 (24.13793%) (suppressed) 19 (19.00000%)
Dementia (suppressed) (suppressed) 0 (0.00000%) (suppressed)
DM 15 (44.11765%) (suppressed) 15 (40.54054%) 37 (37.00000%)
Pneumonia 12 (35.29412%) 11 (37.93103%) 14 (37.83784%) 37 (37.00000%)
DVT 0 (0.00000%) (suppressed) (suppressed) (suppressed)

 

Note above that continuous variables (which are using the default table1 render function), are rounded to five significant digits, while the categorical variables (which use Rgemini's render function), are rounded to five digits after the decimal place. While it is generally discouraged to use cell suppression for only some variable types and not others, we can synchronize the behaviour of the two render functions by supplying the argument rounding.fn = round_pad, which will tell the table1 render functions to use digits digits after the decimal place.

table1(
  ~ gender + age + condition | exposure, 
  data = data,
  render.categorical = render_cell_suppression.categorical,
  #render.continuous = render.continuous, <- this is implicit
  digits = 5,
  rounding.fn = round_pad
  )
pandemic
(N= 34)
post-pandemic
(N= 29)
pre-pandemic
(N= 37)
Overall
(N= 100)
gender
F 20 (58.82353%) 15 (51.72414%) 18 (48.64865%) 53 (53.00000%)
M (suppressed) (suppressed) (suppressed) (suppressed)
Not Available (suppressed) (suppressed) (suppressed) (suppressed)
age
Mean (SD) 69.52687 (5.52689) 69.72315 (5.69476) 70.22003 (5.92857) 69.84026 (5.67689)
Median [Min, Max] 68.47784 [61.37828, 83.38316] 70.63923 [57.98168, 79.65122] 70.34099 [60.08826, 82.32028] 70.18004 [57.98168, 83.38316]
condition
CVD (suppressed) 7 (24.13793%) (suppressed) 19 (19.00000%)
Dementia (suppressed) (suppressed) 0 (0.00000%) (suppressed)
DM 15 (44.11765%) (suppressed) 15 (40.54054%) 37 (37.00000%)
Pneumonia 12 (35.29412%) 11 (37.93103%) 14 (37.83784%) 37 (37.00000%)
DVT 0 (0.00000%) (suppressed) (suppressed) (suppressed)

 

Calculating standardized mean differences

One primary difference between tableone and table1 is the ability to compute and display standardized mean differences (SMDs). To enable this, Rgemini exports max_pairwise_smd().

Note that for a stratification variable with two levels, this corresponds to the absolute value of the SMD (no directionality). For a stratification with more than two levels, this is the maximum pairwise SMD.

It can be supplied to table1 through the extra.col argument, which takes a named list of extra columns to append to the table. Usage is as follows:

table1(~ gender + age + condition | exposure, data = data, extra.col = list("Maximum Standardized Mean Difference" = max_pairwise_smd))
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Overall
(N=100)
Maximum Standardized Mean Difference
gender 0.295
F 20 (58.8%) 15 (51.7%) 18 (48.6%) 53 (53.0%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%) 42 (42.0%)
Not Available 2 (5.9%) 2 (6.9%) 1 (2.7%) 5 (5.0%)
age 0.121
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93) 69.8 (5.68)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3] 70.2 [58.0, 83.4]
condition 0.574
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%) 19 (19.0%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%) 4 (4.0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%) 37 (37.0%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%) 37 (37.0%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%) 3 (3.0%)

 

Replicating table1 functionality

Note that since these extensions are passed as arguments to table1, the entire original API for table1 is still exposed. Therefore we replicate some of the same examples here.

Labelling

Change row labels

We change both row labels below. In order to do this, we need to use the default (i.e. “non-formula”) interface to table1:

labels <- list(
  variables=list(age="Age (years)", gender="Sex", condition="MRDx")
  )

strata <- split(data, data$exposure)
table1(strata, labels = labels)
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Age (years)
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3]
Sex
F 20 (58.8%) 15 (51.7%) 18 (48.6%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%)
Not Available 2 (5.9%) 2 (6.9%) 1 (2.7%)
MRDx
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%)

 

Change strata labels

Next we change the strata labels. To change strata labels we change the names of the levels of the factor variable corresponding to the strata

levels(data$exposure) <- c("During Pandemic", "After Pandemic", "Before Pandemic")
strata <- split(data, data$exposure)

table1(strata, labels = labels)
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
Age (years)
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3]
Sex
F 20 (58.8%) 15 (51.7%) 18 (48.6%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%)
Not Available 2 (5.9%) 2 (6.9%) 1 (2.7%)
MRDx
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%)

 

Add grouping to strata and standardized mean differences

Next we also add grouping to the strata variable, specifying group names in the labels list, as well as a groupspan which is used span a label over multiple groups in the table1 call. We also add standardized mean differences using extra.col to put it all together.

labels$groups <- list("After COVID", "")

extra_col <- list()
extra_col$`SMD` <- max_pairwise_smd

table1(strata, labels = labels, groupspan = c(2, 1), extra.col = extra_col)
After COVID
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
SMD
Age (years) 0.121
Mean (SD) 69.5 (5.53) 69.7 (5.69) 70.2 (5.93)
Median [Min, Max] 68.5 [61.4, 83.4] 70.6 [58.0, 79.7] 70.3 [60.1, 82.3]
Sex 0.295
F 20 (58.8%) 15 (51.7%) 18 (48.6%)
M 12 (35.3%) 12 (41.4%) 18 (48.6%)
Not Available 2 (5.9%) 2 (6.9%) 1 (2.7%)
MRDx 0.574
CVD 5 (14.7%) 7 (24.1%) 7 (18.9%)
Dementia 2 (5.9%) 2 (6.9%) 0 (0%)
DM 15 (44.1%) 7 (24.1%) 15 (40.5%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%)
DVT 0 (0%) 2 (6.9%) 1 (2.7%)

 

Putting it all together

Now we perform custom labeling, cell suppression, and adding a column of standardized mean differences as follows:

table1(
  strata, 
  labels = labels,
  groupspan = c(2, 2),
  render = render_cell_suppression.default,
  extra.col = list("SMD" = max_pairwise_smd)
  )
After COVID
pandemic
(N=34)
post-pandemic
(N=29)
pre-pandemic
(N=37)
SMD
Age (years) 0.121
Mean (SD) 69.527 (± 5.527) 69.723 (± 5.695) 70.220 (± 5.929)
Sex 0.295
F 20 (58.8%) 15 (51.7%) 18 (48.6%)
M (suppressed) (suppressed) (suppressed)
Not Available (suppressed) (suppressed) (suppressed)
MRDx 0.574
CVD (suppressed) 7 (24.1%) (suppressed)
Dementia (suppressed) (suppressed) 0 (0.0%)
DM 15 (44.1%) (suppressed) 15 (40.5%)
Pneumonia 12 (35.3%) 11 (37.9%) 14 (37.8%)
DVT 0 (0.0%) (suppressed) (suppressed)

 

Customization and extending functionality

Note that table1 offers much for flexibility than what is briefly described above with respect to labeling. There is even the possibility to fine tune the table’s appearance with CSS. See more examples in the vignette here.

There is also the possibility to define new render functions and new extra columns.

Creating render functions

Note that render functions must be designed to take as input a single argument corresponding to a vector (numeric or factor), and output a named character vector, where the first element is unnamed and always "". See the examples below for how the output should be formatted:

##                         0            1 
##           "" "19 (59.4%)" "13 (40.6%)"
##                                           Mean (SD) 
##                        "" "10.500 (&plusmn; 5.916)"
##                                               Missing 
##                         "" "&lt; 6 obs. (suppressed)"

Creating extra columns

Note that functions defined for extra columns must always take a named list as the first argument, and ellipses (...) as the last. The named list corresponds to a variable in the data split by the stratifying variable, such as in the example below:

x <- split(mtcars$disp, mtcars$am)

my_max_col <- function(x, ...) {
  lapply(x, max) %>% unlist() %>% max()
}

my_max_col(x)
## [1] 472

Now this newly defined function can be added as an extra column.

table1(~ disp | am, data = mtcars, extra.col = list("My Max" = my_max_col))
0
(N=19)
1
(N=13)
Overall
(N=32)
My Max
disp 472
Mean (SD) 290 (110) 144 (87.2) 231 (124)
Median [Min, Max] 276 [120, 472] 120 [71.1, 351] 196 [71.1, 472]