Multiple Comparison

Introduction

In this page, the approach to dealing with multiple contrasts is explained. Specifically, whether and how to adjust the tests of multiple contrasts to account for an inflated family wise error rate. A few techniques are described, with a focus on when to use which technique. Lastly, instructions and example code is provided for carrying out each technique in R.

In most good experiments, researchers are interested in more than just one contrast. Conducting multiple tests on the levels of a factor can inflate the family wise Type I error rate, as illustrated in the “Multiple T-tests” section of the ANOVA page. There is considerable disagreement among statisticians about how to approach the issue of multiple tests. The debate primarily focuses on whether to proactively take steps that will mitigate the inflated family wise error rate or not. To further complicate matters, if an adjustment is desired, there are multiple techniques to choose from. An exhaustive presentation of the arguments on either side is not attempted in this text. Though, a few of the key ideas will naturally emerge as the different approaches are discussed.

The terms “contrast” and “comparison” are treated here as synonymous. Comparison is used more often to refer to testing a difference in factor level means. Read Contrasts for more explanation of these terms.

No Adjustment

Many statisticians believe that no adjustment is necessary when testing multiple contrasts. Most statisticians would qualify this bold claim by adding that this is true only/especially if there are just a few, pre-planned contrasts to be tested. The strength of this argument grows if the set of contrasts are orthogonal since this places natural limits on the quantity of contrasts and ensures a clean partitioning of the sum of squares.

The main idea behind this approach is more philosophical than mathematical. Proponents of no adjustment argue that a handful of pre-planned contrasts are interesting in isolation and should be tested that way. Thus, the statistical significance of the contrasts are not influenced by characteristics of the experimental design that have no bearing on whether the relationship exists in the real world. For example, how many factor levels exist and how many other contrasts are planned should not have an impact on your ability to find significance in a premeditated comparison.

Making no adjustment to the tests increases the likelihood a researcher will find statistical significance among the contrasts. The danger is that a researcher may be tempted to abuse this fact by planning to test many contrasts. The researcher has not taken the time to distill key hypotheses s/he wants the experiment to answer. A large collection of contrasts to test probably indicates the experiment is being used to generate hypotheses rather than test them. An exploratory exercise of this nature should be treated differently and is not in harmony with the philosophy of those who subscribe to “no adjustment”.

Adjustment Techniques

There is a range of scenarios where multiple contrasts are used. From just a few, pre-planned contrasts to a completely exploratory analysis, to somewhere in between. There are many techniques to choose from. Here we present just 4 adjustment techniques:

  • Bonferroni: for a few, pre-planned contrasts
  • Scheffé: for contrasts resulting from data exploration, or “data snooping”
  • Tukey: for testing all pairwise comparisons, which may be the result of pre-planned contrasts or from data exploration
  • Fisher: for a less conservative approach in testing all pairwise comparisons

Bonferroni

The Bonferroni adjustment is best for a handful of pre-planned contrasts. As the number of contrasts grows, the adjustment quickly becomes too conservative (i.e. makes it too hard to find significance). In that case, other methods may strike a better balance between Type I and Type II errors.

Type I error occurs when a true hypothesis is rejected. Type II error occurs when a false hypothesis is not rejected.

It can be shown that in order to not exceed a desired family wise error rate of \(\alpha_\text{fw}\), for a set of \(k\) contrasts, each individual contrast should be tested at a significance level of \(\frac{\alpha_\text{fw}}{k}\). For example, if we desire a family wise error rate of 0.05 and plan to do 5 tests, each individual contrast must have a p-value less than \(\frac{0.05}{5} = 0.01\) to be considered significant. The test itself has not changed, only the benchmark p-value for claiming significance. This adjustment is easy to understand and to calculate.

Scheffé

Using experimental results to suggest which contrasts to test is often referred to as “data snooping” or exploratory analysis. For example, you may look at graphical and numerical summaries to see which means (or combinations of means) will be promising to test. The problem with this approach is that you have essentially done a quick, informal test of many differences when you looked at descriptive statistics and plots. In other words, you have already tested the means and combinations of the means in your mind. The Type I error rate of the contrasts will be different (higher) than stated because you are only formally applying the test to those that look significant already.

Scheffé’s method is most useful when many contrasts are tested, especially when going beyond pairwise comparisons of factor level means to test combined factor levels (i.e. complex comparisons). In data exploration, there are theoretically an infinite number of contrasts you could test. This is not a problem for Scheffé’s adjustment because, unlike other adjustment techniques, Scheffé’s adjustment does not depend on the number of comparisons to be made. Rather, it is determined only by the number of factor levels and the number of observations.

The F statistic used to test a contrast in Scheffé’s adjustment is related to the omnibus F test for the factor itself, and is given by:

\[ F_\text{Scheffé} = (k-1)*F \]

Where \(F\) is the statistic for the F test of the factor as usual. \(k\) is the number of levels in that factor. \(F_\text{Scheffé}\) has \(df_\text{numerator} = k-1\) and \(df_\text{denominator} = df_\text{residual}\). Scheffé’s test output is often in terms of a t test. Recall that the t statistic is simply the square root of the F statistic.

If you are interested in only a specific set of hypotheses (all pairwise comparisons, or all treatment levels vs. control, or all levels compared to the “best” level, etc.) there may be another adjustment technique that will provide better statistical power. Two of which are mentioned below.

Methods Designed for All Pairwise Comparisons

Whether it is an exploratory analysis or a pre-planned set of contrasts, many researchers want to test all factor level means against each other. This is usually referred to as testing all pairwise comparisons. Because this situation is so common, two approaches are described below.

Tukey’s HSD

If a multiple comparison adjustment is desired for testing all pairwise comparisons, a standard approach is to apply Tukey’s Honest Significant Difference (HSD) technique. Occasionally, in the case of few factor levels, Bonferroni’s adjustment may result in more significant findings. If that is the case, use Bonferroni’s correction instead.

The calculation for this test is based on the distribution of \(Q\). The test statistic \(Q\) is sometimes called the studentized range distribution. “Range” is a reference to the numerator where the difference between a maximum and a minimum is calculated. “Studentized” because we are dividing by the estimated standard error, a technique for standardizing famously employed by Student’s (a.k.a. William Gossett) t test. \(Q\) is calculated as:

\[ Q = \frac{max(T_i) - min(T_i)}{\sqrt{\frac{MSE}{n}}} \tag{1}\]

  • \(MSE\) is an estimate of the random error variance
  • \(n\) is the number of replicates at each factor level. For unequal sample sizes the formula changes somewhat and the result of this test may have a lower Type I error rate than claimed
  • The observations for each level are generated by a different random variable: \(k\) variables total, one for each level. Each random variable is normally distributed with mean zero and standard deviation estimated by the denominator of Equation 1. \(max(T_i)\) is the maximum value of the \(k\) random variables, \(min(T_i)\) is the minimum.
  • The distribution of \(Q\) will depend on the number of treatments being compared, \(k\), and the number of degrees of freedom for error.

Fisher’s LSD

Making adjustments for multiple contrasts is a conservative approach, meaning it is more difficult to claim significance compared to when no adjustment is made. Though these adjustments prevent understating the probability of Type I error, they increase the probability of Type II error for each individual contrast. In exploratory analysis where you are looking for hints of what to study further, a Type II error may be a greater concern than Type I.

For example, consider a screening study intended to narrow down the number of factors studied in a future experiment. In this case, accidentally ruling out something early on that actually does have a significant impact on the response is more egregious than letting a non-significant factor through to the next experiment where it’s non-significance will be discovered.

Fisher’s Least Significant Difference (LSD) employs no adjustment at all to the pairwise comparisons. However, before proceeding to test pairwise comparisons, the F test for the factor must be significant. Using the less powerful F test as a gatekeeper to the more powerful pairwise t tests serves as a partial protection against extreme Type I errors inflation.

In summary, Fisher’s LSD is a two step process. First, verify the F test for the factor is significant. Second, if it is, proceed with all pairwise comparisons without any adjustment. Fisher’s LSD tends to be used in studies where many factors are present, especially screening/exploratory studies.

Summary Table

Adjustment When to Use Implications
Bonferroni A small number of pre-planned contrasts Too difficult to find significance if many contrasts are tested.
Scheffe Post-hoc contrasts/data snooping, particularly useful for complex comparisons Most severe adjustment. Difficult to find significance, but keeps Type I error in check
Tukey’s HSD When testing all pairwise comparisons and want to control family wise significance level A standard adjustment for all pairwise comparisons: pre-planned or post-hoc.
Fisher’s LSD When testing all pairwise comparisons and willing to accept higher Type I error, e.g. in a screening study Easier to find significance than if an adjustment is applied, but higher risk of Type I error.

R Instructions

For most adjustment techniques, before testing the contrasts the data needs to be read in and a model created. The adjustment techniques will be illustrated using data from the toothbrush BF[1] example.

Code
df <- read_csv("data/toothpaste_BF2.csv")
plaque_aov <- aov(Plaque ~ Brush, data = df)

There are many different commands and packages that handle adjustments to account for multiple contrast testing. The use of the contrast statement from the emmeans package is used here to reduce the amount of new commands you need to learn. This also keeps a set of commands that are consistent with what was previously described in the Contrasts page.

Before performing any contrasts or comparisons, save the emmeans to an object, which we name brush_means.

Code
#Be sure to load the emmeans package first
brush_means <- emmeans(plaque_aov, "Brush")

Bonferonni

There are two ways to implement the Bonferroni adjustment illustrated below using the contrasts that were tested in the R Instructions section of the Contrast page.

Recalculate an Alpha Level by Hand

First, the Bonferroni adjustment is shown using output provided for a test of contrasts without adjustment. The code and output from the R Instructions section of the Contrast page is provided again here for convenience. These two contrasts test the mean of the manual brush against the oscillating brush mean, as well as the manual brush mean against the mean of all the other brushes combined.

Code
contrast_results <- contrast(brush_means,
                             list(man_v_osc = c(1,-1,0,0),
                                  man_v_others = c(1,-(1/3),-(1/3),-(1/3))), 
                             adjust="none")

#kable commands are for formatting the output
contrast_results |> 
  kable(digits = 2) |> 
  kable_styling(full_width = TRUE)
contrast estimate SE df t.ratio p.value
man_v_osc 3.12 1.58 20 1.97 0.06
man_v_others 0.44 1.29 20 0.34 0.74


A new alpha level against which to compare the p-values can be calculated by hand. Since there are two tests, if the intent was to keep the family wise error rate of 0.05, the alpha level for each individual test is \(\frac{0.05}{2} = 0.025\). To be considered significant, the contrast’s p-value must be less than 0.025.

man_v_osc has a p-value of 0.06; man_v_others has a p-value of 0.74. Since neither contrast has a p-value less than 0.025 we conclude that the contrasts are not statistically significant.

Make the Adjustment in R

Second, the Bonferroni adjustment is applied by changing the value of the adjust = argument in the contrast() function from “none” to “bon”.

Code
bon_results <- contrast(brush_means,
                        list(man_v_osc = c(1,-1,0,0),
                             man_v_others = c(1,-(1/3),-(1/3),-(1/3))), 
                        adjust="bon")

#kable commands are for formatting the output
bon_results |> 
  kable(digits = 2) |> 
  kable_styling(full_width = TRUE)
contrast estimate SE df t.ratio p.value
man_v_osc 3.12 1.58 20 1.97 0.13
man_v_others 0.44 1.29 20 0.34 1.00


In these results the p-values have simply been multiplied by 2 because there were 2 contrasts in the set being tested. Each test can be compared to our desired family wise error rate as usual. Since each test has a p-value greater than 0.05 we conclude that the contrasts are not statistically significant.

Note, for man_v_others the p-value is capped at 1.00 since a p-value cannot exceed 1.00.

Scheffé

Similar to how the Bonferroni adjustment was applied, to apply a Scheffé adjustment, change the value of the adjust = argument in the contrast() function from “none” to “scheffe”.

Code
contrast(brush_means, 
         list(man_v_osc = c(1,-1,0,0),
              man_v_others = c(1,-(1/3),-(1/3),-(1/3))),
         adjust="scheffe") |> 

  #kable commands are for formatting the output
  kable(digits = 2) |> 
  kable_styling(full_width = TRUE)
contrast estimate SE df t.ratio p.value
man_v_osc 3.12 1.58 20 1.97 0.17
man_v_others 0.44 1.29 20 0.34 0.94


Each contrast’s p-value can be compared to our desired family wise error rate of 0.05. Since each test has a p-value greater than 0.05 we conclude that the contrasts are not statistically significant. Notice the p-values with Scheffé’s adjustment are higher than the p-values with Bonferroni’s adjustment. This will always be the case if the number of contrasts being tested is small.

Tukey

In our toothbrush example, if we want to compare each factor level mean to the other we can apply Tukey’s HSD (Honest Significant Different) adjustment. To do this, instead of inputting custom contrasts to the contrast() command in R, we alter the method argument to be pairwise. We also need to change the adjust argument to tukey. The output provides an estimate of each pairwise comparison, as well as adjusted p-values.

For Tukey adjustment to all pairwise comparisons, this base R shortcut can also be used. The arguments to the function are the name of the model and the name of the factor whose factor level means should be tested.

TukeyHSD(plaque_aov, "Brush")
Code
contrast(brush_means, method = "pairwise", adjust = "tukey")
 contrast                 estimate   SE df t.ratio p.value
 Manual - Oscillating        3.117 1.58 20   1.967  0.2332
 Manual - Sonic              0.418 1.58 20   0.264  0.9933
 Manual - Ultrasonic        -2.220 1.58 20  -1.401  0.5130
 Oscillating - Sonic        -2.698 1.58 20  -1.703  0.3480
 Oscillating - Ultrasonic   -5.337 1.58 20  -3.369  0.0149
 Sonic - Ultrasonic         -2.638 1.58 20  -1.666  0.3670

P value adjustment: tukey method for comparing a family of 4 estimates 

With a p-value of 0.0149, only the Ultrasonic-Oscillating contrast is significant at the 0.05 level. This significance is driven by the large difference in means between the two levels, Ultrasonic’s mean is 5.337 higher than Oscillating. All other pairwise comparisons have p-values greater than 0.05 and so are not considered significant.

Confidence Intervals Instead of p-values

If confidence intervals are desired instead of p-values, the output of the contrast() command can be saved to an object, which then becomes the input to the confint() command. This is true for ANY contrasts computed with emmeans::contrast(), not just Tukey and not just pairwise comparisons. The adjustment specified in contrast() is then applied to the confidence interval as well.

Code
for_ci <- contrast(brush_means, method = "pairwise", adjust = "tukey") 
confint(for_ci)
 contrast                 estimate   SE df lower.CL upper.CL
 Manual - Oscillating        3.117 1.58 20    -1.32    7.550
 Manual - Sonic              0.418 1.58 20    -4.02    4.852
 Manual - Ultrasonic        -2.220 1.58 20    -6.65    2.214
 Oscillating - Sonic        -2.698 1.58 20    -7.13    1.735
 Oscillating - Ultrasonic   -5.337 1.58 20    -9.77   -0.903
 Sonic - Ultrasonic         -2.638 1.58 20    -7.07    1.795

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 4 estimates 

Fisher

First, look at the ANOVA summary table to see if the F test for brush is significant.

Code
summary(plaque_aov)
            Df Sum Sq Mean Sq F value Pr(>F)  
Brush        3  86.31  28.769   3.822 0.0258 *
Residuals   20 150.56   7.528                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The p-value for Brush is significant at the 0.05 level. You can then proceed with an un-adjusted test of all pairwise comparisons for the Brush factor. The adjustment argument in the contrast() command in this case is none.

Code
contrast(brush_means, method = "pairwise", adjust = "none")
 contrast                 estimate   SE df t.ratio p.value
 Manual - Oscillating        3.117 1.58 20   1.967  0.0632
 Manual - Sonic              0.418 1.58 20   0.264  0.7944
 Manual - Ultrasonic        -2.220 1.58 20  -1.401  0.1764
 Oscillating - Sonic        -2.698 1.58 20  -1.703  0.1040
 Oscillating - Ultrasonic   -5.337 1.58 20  -3.369  0.0031
 Sonic - Ultrasonic         -2.638 1.58 20  -1.666  0.1114


The p-value for Oscillating - Ultrasonic is 0.0031 and is the only significant pairwise comparison at the 0.05 level; though Manual - Oscillating is close, with a p-value of 0.0632.

Bonus: Estimating and Testing Effect Sizes

If no custom contrasts are provided to the contrast() command, and no argument for method is provided, the default output is an estimate (and test) of each factor level effect, as seen below!

Code
(effect_sizes <- contrast(brush_means)) 
 contrast           estimate   SE df t.ratio p.value
 Manual effect        0.3287 0.97 20   0.339  0.9273
 Oscillating effect  -2.7879 0.97 20  -2.874  0.0323
 Sonic effect        -0.0896 0.97 20  -0.092  0.9273
 Ultrasonic effect    2.5488 0.97 20   2.627  0.0323

P value adjustment: fdr method for 4 tests 

Note the behavior of the defaults is a little different than what might be expected. “fdr” stands for “false detection rate”. It makes adjustments to keep the overall proportion of Type I errors fixed (e.g. 5% of tests will be a Type I error). This is somewhat less stringent than keeping the family-wise Type I error rate fixed.

You can use confint() to get confidence intervals for the effects (with or without adjustments) as well.

Code
confint(effect_sizes)
 contrast           estimate   SE df lower.CL upper.CL
 Manual effect        0.3287 0.97 20   -2.333    2.991
 Oscillating effect  -2.7879 0.97 20   -5.450   -0.126
 Sonic effect        -0.0896 0.97 20   -2.752    2.573
 Ultrasonic effect    2.5488 0.97 20   -0.113    5.211

Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 4 estimates