This section gives some extra coding options in R in order to be able to do some simple descriptive statistics on your data (chi-square, t-test, Kruskal-Wallis) and calculate Odds Ratios as if you were to conduct a case control analysis univariate analysis.
For this we will use the following functions:
In the example of this AJS data, the current data frame (dataset) you are using contains “Confirmed”, “Suspected” and “Probable cases”.
We would like to compare the exposures between “Confirmed” and “Suspected” cases ONLY, so we need to create a data frame that only includes those cases.
# Creating a data frame needed for the univariate analysis linelist_cc <- linelist_cleaned %>% filter(case_def == "Confirmed" | case_def == "Suspected")
tab_univariate() function requires your exposure and outcomes variables to be TRUE/FALSE statements.
Thus, you will need to modify the variables you are using to be TRUE/FALSE (in this example: case definition, vomit, jaundice, and patient_facility_type).
To do this, you can gather all of the binary variables into a single vector. With this, you can use
mutate_at() to apply
str_detect() from the stringr package to all of the binary variables to return TRUE if the elements match either Confirmed, Oui, or Inpatient.
## Create vector that specifies the variables we want to convert binary_vars <- c("case_def", "ptvomit", "ptjaundice", "patient_facility_type") ## Apply str_detect on each of the columns to return TRUE for each element that ## matches either Confirmed, Oui, or Inpatient linelist_cc <- linelist_cc %>% mutate_at(.vars = binary_vars, .funs = str_detect, pattern = "Confirmed|Oui|Inpatient")
To compare the proportion of confirmed and suspected cases exposed to certain categorical variables, we will want to compare them using the chi-square test.
linelist_cc %>% tab_linelist(age_group, sex, strata = case_def, na.rm = FALSE) %>% ## call variables something more accessible for the table output rename(suspected_n = "FALSE n", confirmed_n = "TRUE n") %>% ## group by variable for the chisq test group_by(variable) %>% ## run chi-sq test on the contingency table mutate(binom = list(broom::tidy(chisq.test(cbind(suspected_n, confirmed_n))))) %>% ## make results of chisq test available tidyr::unnest(cols = c(binom)) %>% ## ungroup to be able to change the names ungroup() %>% ## get rid of duplicate var names and pvals mutate(variable = replace(variable, duplicated(variable), NA), p.value = replace(p.value, duplicated(p.value), NA)) %>% ## select and rename columns select( variable, "Value" = value, "Suspected (n)" = suspected_n, "%" = "FALSE proportion", "Confirmed (n)" = confirmed_n, "%" = "TRUE proportion", "P-value" = p.value ) %>% knitr::kable(digits = 1)
|variable||Value||Suspected (n)||%||Confirmed (n)||%||P-value|
To compare the means of continuous variables between the confirmed and suspected cases, we will use the t-test.
## run ttest t.test(age_years ~ case_def, var.equal = TRUE, data = linelist_cc) %>% ## convert to a data frame broom::tidy() %>% ## select and rename columns select("Suspected (mean)" = estimate1, "Confirmed (mean)" = estimate2, "p-value" = p.value) %>% ## create a column for the variable name tibble::add_column(Variable = "Age (years)", .before = 1) %>% knitr::kable(digits = 1)
|Variable||Suspected (mean)||Confirmed (mean)||p-value|
As most of the time, your contiuous variables will not be normally distributed, instead of calculating the difference between your cases and controls (confirmed and suspected cases in this example) with the t-test, we use the Kruskal-Wallis test.
## first create a table with medians and standard deviation ## then a table with the kruskal value ## then bind together medians_tab <- linelist_cc %>% group_by(case_def) %>% summarise(Median = median(age_years, na.rm = TRUE), SD = sd(age_years, na.rm = TRUE)) %>% tidyr::pivot_wider(names_from = case_def, values_from = c("Median", "SD")) ## perform the Kruskal-Wallace test and save the results kw <- kruskal.test(age_years ~ case_def, data = linelist_cc) medians_tab %>% ## add the variable and p-value columns tibble::add_column(Variable = "Age (years)", p.value = kw$p.value) %>% ## select and rename the columns in the right order select(Variable, "Control (median)" = Median_FALSE, "SD" = SD_FALSE, "Case (median)" = Median_TRUE, "SD" = SD_TRUE, "p-value" = p.value) %>% knitr::kable(digits = 1)
|Variable||Control (median)||SD||Case (median)||SD||p-value|
Now you have a good sense of the variables that you might want to include in your univariate case control analysis.
Please note that the coding now switches to the use of “cases” and “controls” to highlight the type of analysis you are conducting. You need to be clear for yourself how you have defined your cases (in this example, “confirmed cases”) and your controls (in this example, “suspected cases”).
## Odds ratios ## other values are already set at the correct defaults for CC linelist_cc %>% tab_univariate(case_def, # select outcome variable ptvomit, ptjaundice, patient_facility_type, # select exposure variables measure = "OR", # calculate odds ratios mergeCI = TRUE, # paste lower and upper together digits = 1) %>% # limit decimal places to 1 select("Exposure" = variable, # select and rename columns "Exposed cases" = exp_cases, "Unexposed cases" = unexp_cases, "Exposed controls" = exp_controls, "Unexposed controls" = unexp_controls, "OR (95%CI)" = est_ci) %>% knitr::kable(digits = 1)
|Exposure||Exposed cases||Unexposed cases||Exposed controls||Unexposed controls||OR (95%CI)|
As you will want to double check on particular confounding (before you do a multivariable analysis), you might want to check whether odds ratios change if you conduct the same analysis in a stratified manner.
In this case, we give an example of calculating Odds Ratios for confirmed cases and suspected cases in the stratum ‘patient_facility_type’. This allows us to compare whether the ORs change when comparing groups that are admitted to hospital and those that are not admitted t hospital.
Note: in the current example dataset, this analysis doesnt provide much more insight. But you should as standard practice check for confounding in your outbreak data analysis.
## stratified odds ratios linelist_cc %>% tab_univariate(case_def, # select outcome variable ptvomit, ptjaundice, # select exposure variables strata = patient_facility_type, # select stratifying variable measure = "OR", # calculate odds ratios mergeCI = TRUE, # paste lower and upper together digits = 1, # limit decimal places to 1 woolf_test = TRUE) %>% # calculate p val between strata # get rid of duplicate var names mutate(variable = replace(variable, duplicated(variable), NA)) %>% select("Exposure" = variable, # select and rename columns "Measure" = est_type, "Exposed cases" = exp_cases, "Unexposed cases" = unexp_cases, "Exposed controls" = exp_controls, "Unexposed controls" = unexp_controls, "OR (95%CI)" = est_ci, "p-value" = p.value) %>% knitr::kable(digits = 1)
|Exposure||Measure||Exposed cases||Unexposed cases||Exposed controls||Unexposed controls||OR (95%CI)||p-value|
|-||patient_facility_type: TRUE||6||5||2||4||2.4 (0.3–19.0)||0.4|
|-||patient_facility_type: FALSE||11||15||193||258||1.0 (0.4–2.2)||1.0|
|-||patient_facility_type: TRUE||11||0||7||0||NaN (NaN–NaN)||-|
|-||patient_facility_type: FALSE||22||4||419||31||0.4 (0.1–1.3)||0.1|