Univariate Analysis

This section gives some extra coding options in R in order to be able to do some simple descriptive statistics on your data (chi-square, t-test, Kruskal-Wallis) and calculate Odds Ratios as if you were to conduct a case control analysis univariate analysis.

For this we will use the following functions:

  • tab_linelist
  • tab_univariate

Create data frame to be used for the analysis

In the example of this AJS data, the current data frame (dataset) you are using contains “Confirmed”, “Suspected” and “Probable cases”.

We would like to compare the exposures between “Confirmed” and “Suspected” cases ONLY, so we need to create a data frame that only includes those cases.

# Creating a data frame needed for the univariate analysis
linelist_cc <- linelist_cleaned %>%
  filter(case_def == "Confirmed" | case_def == "Suspected")

Create list of variables to change

The tab_univariate() function requires your exposure and outcomes variables to be TRUE/FALSE statements.

Thus, you will need to modify the variables you are using to be TRUE/FALSE (in this example: case definition, vomit, jaundice, and patient_facility_type).

To do this, you can gather all of the binary variables into a single vector. With this, you can use mutate_at() to apply str_detect() from the stringr package to all of the binary variables to return TRUE if the elements match either Confirmed, Oui, or Inpatient.

## Create vector that specifies the variables we want to convert
binary_vars <- c("case_def", "ptvomit", "ptjaundice", "patient_facility_type")

## Apply str_detect on each of the columns to return TRUE for each element that 
## matches either Confirmed, Oui, or Inpatient
linelist_cc <- linelist_cc %>%
  mutate_at(.vars = binary_vars, 
            .funs = str_detect,
            pattern = "Confirmed|Oui|Inpatient")

Use Chi-square tests to check the difference in characteristics or exposures between confirmed and suspected cases for categorical variables

To compare the proportion of confirmed and suspected cases exposed to certain categorical variables, we will want to compare them using the chi-square test.

linelist_cc %>%
tab_linelist(age_group, sex, strata = case_def, na.rm = FALSE) %>%
  ## call variables something more accessible for the table output
  rename(suspected_n = "FALSE n",
         confirmed_n = "TRUE n") %>%
  ## group by variable for the chisq test
  group_by(variable) %>%
  ## run chi-sq test on the contingency table
  mutate(binom = list(broom::tidy(chisq.test(cbind(suspected_n, confirmed_n))))) %>%
  ## make results of chisq test available
  tidyr::unnest(cols = c(binom)) %>%
  ## ungroup to be able to change the names
  ungroup() %>%
  ## get rid of duplicate var names and pvals
  mutate(variable = replace(variable, duplicated(variable), NA),
         p.value = replace(p.value, duplicated(p.value), NA)) %>%
  ## select and rename columns
  select(
    variable,
    "Value" = value,
    "Suspected (n)" = suspected_n,
    "%" = "FALSE proportion",
    "Confirmed (n)" = confirmed_n,
    "%" = "TRUE proportion",
    "P-value" = p.value
  ) %>%
  knitr::kable(digits = 1)
variable Value Suspected (n) % Confirmed (n) % P-value
age_group 0-2 40 5.8 4 4 0.1
- 3-14 284 40.9 30 30 -
- 15-29 246 35.4 46 46 -
- 30-44 95 13.7 15 15 -
- 45+ 29 4.2 4 4 -
- Missing 1 0.1 1 1 -
sex F 346 49.8 59 59 0.1
- M 349 50.2 41 41 -

Use T-test to check the difference in the characteristics or exposures for continuous variables between confirmed and suspected cases (for normally distributed data)

To compare the means of continuous variables between the confirmed and suspected cases, we will use the t-test.

## run ttest
t.test(age_years ~ case_def, var.equal = TRUE, data = linelist_cc) %>%
  ## convert to a data frame
  broom::tidy() %>%
  ## select and rename columns
  select("Suspected (mean)" = estimate1,
         "Confirmed (mean)" = estimate2,
         "p-value" = p.value) %>%
  ## create a column for the variable name
  tibble::add_column(Variable = "Age (years)", .before = 1) %>%
  knitr::kable(digits = 1)
Variable Suspected (mean) Confirmed (mean) p-value
Age (years) 18 19.6 0.2

Use Kruskal-wallis to check the difference in the characteristics or exposures for continuous variables between confirmed and suspected cases (for non-normally distributed data)

As most of the time, your contiuous variables will not be normally distributed, instead of calculating the difference between your cases and controls (confirmed and suspected cases in this example) with the t-test, we use the Kruskal-Wallis test.

## first create a table with medians and standard deviation
## then a table with the kruskal value
## then bind together

medians_tab <- linelist_cc %>%
  group_by(case_def) %>%
  summarise(Median = median(age_years, na.rm = TRUE),
            SD = sd(age_years, na.rm = TRUE)) %>%
  tidyr::pivot_wider(names_from = case_def, values_from = c("Median", "SD"))

## perform the Kruskal-Wallace test and save the results
kw <- kruskal.test(age_years ~ case_def, data = linelist_cc)

medians_tab %>%
  ## add the variable and p-value columns
  tibble::add_column(Variable = "Age (years)", p.value = kw$p.value) %>%
  ## select and rename the columns in the right order
  select(Variable,
         "Control (median)" = Median_FALSE,
         "SD" = SD_FALSE,
         "Case (median)" = Median_TRUE,
         "SD" = SD_TRUE,
         "p-value" = p.value) %>%
  knitr::kable(digits = 1)
Variable Control (median) SD Case (median) SD p-value
Age (years) 15 13.2 20 13 0.1

Calculating odds ratios for a univariate case control analysis

Now you have a good sense of the variables that you might want to include in your univariate case control analysis.

Please note that the coding now switches to the use of “cases” and “controls” to highlight the type of analysis you are conducting. You need to be clear for yourself how you have defined your cases (in this example, “confirmed cases”) and your controls (in this example, “suspected cases”).

## Odds ratios
## other values are already set at the correct defaults for CC
linelist_cc %>%
  tab_univariate(case_def,                                    # select outcome variable
                 ptvomit, ptjaundice, patient_facility_type,  # select exposure variables
                 measure = "OR",                              # calculate odds ratios
                 mergeCI = TRUE,                              # paste lower and upper together
                 digits = 1)   %>%                            # limit decimal places to 1
  select("Exposure" = variable,                               # select and rename columns
         "Exposed cases" = exp_cases,
         "Unexposed cases" = unexp_cases,
         "Exposed controls" = exp_controls,
         "Unexposed controls" = unexp_controls,
         "OR (95%CI)" = est_ci) %>%
  knitr::kable(digits = 1)
Exposure Exposed cases Unexposed cases Exposed controls Unexposed controls OR (95%CI)
ptvomit 17 21 203 279 1.1 (0.6–2.2)
ptjaundice 35 4 452 31 0.6 (0.2–1.8)
patient_facility_type 44 52 42 614 12.4 (7.4–20.6)

Calculating stratified odds ratios for a univariate case control analysis

As you will want to double check on particular confounding (before you do a multivariable analysis), you might want to check whether odds ratios change if you conduct the same analysis in a stratified manner.

In this case, we give an example of calculating Odds Ratios for confirmed cases and suspected cases in the stratum ‘patient_facility_type’. This allows us to compare whether the ORs change when comparing groups that are admitted to hospital and those that are not admitted t hospital.

Note: in the current example dataset, this analysis doesnt provide much more insight. But you should as standard practice check for confounding in your outbreak data analysis.

## stratified odds ratios
linelist_cc %>%
  tab_univariate(case_def,                          # select outcome variable
                 ptvomit, ptjaundice,               # select exposure variables
                 strata = patient_facility_type,    # select stratifying variable
                 measure = "OR",                    # calculate odds ratios
                 mergeCI = TRUE,                    # paste lower and upper together
                 digits = 1,                        # limit decimal places to 1
                 woolf_test = TRUE) %>%             # calculate p val between strata
  # get rid of duplicate var names
  mutate(variable = replace(variable, duplicated(variable), NA)) %>%
  select("Exposure" = variable,                     # select and rename columns
         "Measure"  = est_type,
         "Exposed cases" = exp_cases,
         "Unexposed cases" = unexp_cases,
         "Exposed controls" = exp_controls,
         "Unexposed controls" = unexp_controls,
         "OR (95%CI)" = est_ci,
         "p-value" = p.value) %>%
  knitr::kable(digits = 1)
Exposure Measure Exposed cases Unexposed cases Exposed controls Unexposed controls OR (95%CI) p-value
ptvomit crude 17 20 195 262 1.1 (0.6–2.2) 0.7
- patient_facility_type: TRUE 6 5 2 4 2.4 (0.3–19.0) 0.4
- patient_facility_type: FALSE 11 15 193 258 1.0 (0.4–2.2) 1.0
- MH - - - - 1.1 (0.5–2.3) -
- woolf - - - - NA (NA–NA) 0.5
ptjaundice crude 33 4 426 31 0.6 (0.2–1.8) 0.4
- patient_facility_type: TRUE 11 0 7 0 NaN (NaN–NaN) -
- patient_facility_type: FALSE 22 4 419 31 0.4 (0.1–1.3) 0.1
- MH - - - - 0.4 (0.1–1.3) -
- woolf - - - - NA (NA–NA) 0.5