Code
<- read_csv("data/toothpaste_BF2.csv")
df <- aov(Plaque ~ Brush, data = df) plaque_aov
In this page, the approach to dealing with multiple contrasts is explained. Specifically, whether and how to adjust the tests of multiple contrasts to account for an inflated family wise error rate. A few techniques are described, with a focus on when to use which technique. Lastly, instructions and example code is provided for carrying out each technique in R.
In most good experiments, researchers are interested in more than just one contrast. Conducting multiple tests on the levels of a factor can inflate the family wise Type I error rate, as illustrated in the “Multiple T-tests” section of the ANOVA page. There is considerable disagreement among statisticians about how to approach the issue of multiple tests. The debate primarily focuses on whether to proactively take steps that will mitigate the inflated family wise error rate or not. To further complicate matters, if an adjustment is desired, there are multiple techniques to choose from. An exhaustive presentation of the arguments on either side is not attempted in this text. Though, a few of the key ideas will naturally emerge as the different approaches are discussed.
The terms “contrast” and “comparison” are treated here as synonymous. Comparison is used more often to refer to testing a difference in factor level means. Read Contrasts for more explanation of these terms.
Many statisticians believe that no adjustment is necessary when testing multiple contrasts. Most statisticians would qualify this bold claim by adding that this is true only/especially if there are just a few, pre-planned contrasts to be tested. The strength of this argument grows if the set of contrasts are orthogonal since this places natural limits on the quantity of contrasts and ensures a clean partitioning of the sum of squares.
The main idea behind this approach is more philosophical than mathematical. Proponents of no adjustment argue that a handful of pre-planned contrasts are interesting in isolation and should be tested that way. Thus, the statistical significance of the contrasts are not influenced by characteristics of the experimental design that have no bearing on whether the relationship exists in the real world. For example, how many factor levels exist and how many other contrasts are planned should not have an impact on your ability to find significance in a premeditated comparison.
Making no adjustment to the tests increases the likelihood a researcher will find statistical significance among the contrasts. The danger is that a researcher may be tempted to abuse this fact by planning to test many contrasts. The researcher has not taken the time to distill key hypotheses s/he wants the experiment to answer. A large collection of contrasts to test probably indicates the experiment is being used to generate hypotheses rather than test them. An exploratory exercise of this nature should be treated differently and is not in harmony with the philosophy of those who subscribe to “no adjustment”.
There is a range of scenarios where multiple contrasts are used. From just a few, pre-planned contrasts to a completely exploratory analysis, to somewhere in between. There are many techniques to choose from. Here we present just 4 adjustment techniques:
The Bonferroni adjustment is best for a handful of pre-planned contrasts. As the number of contrasts grows, the adjustment quickly becomes too conservative (i.e. makes it too hard to find significance). In that case, other methods may strike a better balance between Type I and Type II errors.
Type I error occurs when a true hypothesis is rejected. Type II error occurs when a false hypothesis is not rejected.
It can be shown that in order to not exceed a desired family wise error rate of \(\alpha_\text{fw}\), for a set of \(k\) contrasts, each individual contrast should be tested at a significance level of \(\frac{\alpha_\text{fw}}{k}\). For example, if we desire a family wise error rate of 0.05 and plan to do 5 tests, each individual contrast must have a p-value less than \(\frac{0.05}{5} = 0.01\) to be considered significant. The test itself has not changed, only the benchmark p-value for claiming significance. This adjustment is easy to understand and to calculate.
Using experimental results to suggest which contrasts to test is often referred to as “data snooping” or exploratory analysis. For example, you may look at graphical and numerical summaries to see which means (or combinations of means) will be promising to test. The problem with this approach is that you have essentially done a quick, informal test of many differences when you looked at descriptive statistics and plots. In other words, you have already tested the means and combinations of the means in your mind. The Type I error rate of the contrasts will be different (higher) than stated because you are only formally applying the test to those that look significant already.
Scheffé’s method is most useful when many contrasts are tested, especially when going beyond pairwise comparisons of factor level means to test combined factor levels (i.e. complex comparisons). In data exploration, there are theoretically an infinite number of contrasts you could test. This is not a problem for Scheffé’s adjustment because, unlike other adjustment techniques, Scheffé’s adjustment does not depend on the number of comparisons to be made. Rather, it is determined only by the number of factor levels and the number of observations.
The F statistic used to test a contrast in Scheffé’s adjustment is related to the omnibus F test for the factor itself, and is given by:
\[ F_\text{Scheffé} = (k-1)*F \]
Where \(F\) is the statistic for the F test of the factor as usual. \(k\) is the number of levels in that factor. \(F_\text{Scheffé}\) has \(df_\text{numerator} = k-1\) and \(df_\text{denominator} = df_\text{residual}\). Scheffé’s test output is often in terms of a t test. Recall that the t statistic is simply the square root of the F statistic.
If you are interested in only a specific set of hypotheses (all pairwise comparisons, or all treatment levels vs. control, or all levels compared to the “best” level, etc.) there may be another adjustment technique that will provide better statistical power. Two of which are mentioned below.
Whether it is an exploratory analysis or a pre-planned set of contrasts, many researchers want to test all factor level means against each other. This is usually referred to as testing all pairwise comparisons. Because this situation is so common, two approaches are described below.
If a multiple comparison adjustment is desired for testing all pairwise comparisons, a standard approach is to apply Tukey’s Honest Significant Difference (HSD) technique. Occasionally, in the case of few factor levels, Bonferroni’s adjustment may result in more significant findings. If that is the case, use Bonferroni’s correction instead.
The calculation for this test is based on the distribution of \(Q\). The test statistic \(Q\) is sometimes called the studentized range distribution. “Range” is a reference to the numerator where the difference between a maximum and a minimum is calculated. “Studentized” because we are dividing by the estimated standard error, a technique for standardizing famously employed by Student’s (a.k.a. William Gossett) t test. \(Q\) is calculated as:
\[ Q = \frac{max(T_i) - min(T_i)}{\sqrt{\frac{MSE}{n}}} \tag{1}\]
Making adjustments for multiple contrasts is a conservative approach, meaning it is more difficult to claim significance compared to when no adjustment is made. Though these adjustments prevent understating the probability of Type I error, they increase the probability of Type II error for each individual contrast. In exploratory analysis where you are looking for hints of what to study further, a Type II error may be a greater concern than Type I.
For example, consider a screening study intended to narrow down the number of factors studied in a future experiment. In this case, accidentally ruling out something early on that actually does have a significant impact on the response is more egregious than letting a non-significant factor through to the next experiment where it’s non-significance will be discovered.
Fisher’s Least Significant Difference (LSD) employs no adjustment at all to the pairwise comparisons. However, before proceeding to test pairwise comparisons, the F test for the factor must be significant. Using the less powerful F test as a gatekeeper to the more powerful pairwise t tests serves as a partial protection against extreme Type I errors inflation.
In summary, Fisher’s LSD is a two step process. First, verify the F test for the factor is significant. Second, if it is, proceed with all pairwise comparisons without any adjustment. Fisher’s LSD tends to be used in studies where many factors are present, especially screening/exploratory studies.
Adjustment | When to Use | Implications |
---|---|---|
Bonferroni | A small number of pre-planned contrasts | Too difficult to find significance if many contrasts are tested. |
Scheffe | Post-hoc contrasts/data snooping, particularly useful for complex comparisons | Most severe adjustment. Difficult to find significance, but keeps Type I error in check |
Tukey’s HSD | When testing all pairwise comparisons and want to control family wise significance level | A standard adjustment for all pairwise comparisons: pre-planned or post-hoc. |
Fisher’s LSD | When testing all pairwise comparisons and willing to accept higher Type I error, e.g. in a screening study | Easier to find significance than if an adjustment is applied, but higher risk of Type I error. |
For most adjustment techniques, before testing the contrasts the data needs to be read in and a model created. The adjustment techniques will be illustrated using data from the toothbrush BF[1] example.
There are many different commands and packages that handle adjustments to account for multiple contrast testing. The use of the contrast
statement from the emmeans
package is used here to reduce the amount of new commands you need to learn. This also keeps a set of commands that are consistent with what was previously described in the Contrasts page.
Before performing any contrasts or comparisons, save the emmeans
to an object, which we name brush_means
.
There are two ways to implement the Bonferroni adjustment illustrated below using the contrasts that were tested in the R Instructions section of the Contrast page.
First, the Bonferroni adjustment is shown using output provided for a test of contrasts without adjustment. The code and output from the R Instructions section of the Contrast page is provided again here for convenience. These two contrasts test the mean of the manual brush against the oscillating brush mean, as well as the manual brush mean against the mean of all the other brushes combined.
contrast | estimate | SE | df | t.ratio | p.value |
---|---|---|---|---|---|
man_v_osc | 3.12 | 1.58 | 20 | 1.97 | 0.06 |
man_v_others | 0.44 | 1.29 | 20 | 0.34 | 0.74 |
A new alpha level against which to compare the p-values can be calculated by hand. Since there are two tests, if the intent was to keep the family wise error rate of 0.05, the alpha level for each individual test is \(\frac{0.05}{2} = 0.025\). To be considered significant, the contrast’s p-value must be less than 0.025.
man_v_osc
has a p-value of 0.06; man_v_others
has a p-value of 0.74. Since neither contrast has a p-value less than 0.025 we conclude that the contrasts are not statistically significant.
Second, the Bonferroni adjustment is applied by changing the value of the adjust =
argument in the contrast()
function from “none” to “bon”.
contrast | estimate | SE | df | t.ratio | p.value |
---|---|---|---|---|---|
man_v_osc | 3.12 | 1.58 | 20 | 1.97 | 0.13 |
man_v_others | 0.44 | 1.29 | 20 | 0.34 | 1.00 |
In these results the p-values have simply been multiplied by 2 because there were 2 contrasts in the set being tested. Each test can be compared to our desired family wise error rate as usual. Since each test has a p-value greater than 0.05 we conclude that the contrasts are not statistically significant.
Note, for man_v_others
the p-value is capped at 1.00 since a p-value cannot exceed 1.00.
Similar to how the Bonferroni adjustment was applied, to apply a Scheffé adjustment, change the value of the adjust =
argument in the contrast()
function from “none” to “scheffe”.
contrast | estimate | SE | df | t.ratio | p.value |
---|---|---|---|---|---|
man_v_osc | 3.12 | 1.58 | 20 | 1.97 | 0.17 |
man_v_others | 0.44 | 1.29 | 20 | 0.34 | 0.94 |
Each contrast’s p-value can be compared to our desired family wise error rate of 0.05. Since each test has a p-value greater than 0.05 we conclude that the contrasts are not statistically significant. Notice the p-values with Scheffé’s adjustment are higher than the p-values with Bonferroni’s adjustment. This will always be the case if the number of contrasts being tested is small.
In our toothbrush example, if we want to compare each factor level mean to the other we can apply Tukey’s HSD (Honest Significant Different) adjustment. To do this, instead of inputting custom contrasts to the contrast()
command in R, we alter the method
argument to be pairwise
. We also need to change the adjust
argument to tukey
. The output provides an estimate of each pairwise comparison, as well as adjusted p-values.
contrast estimate SE df t.ratio p.value
Manual - Oscillating 3.117 1.58 20 1.967 0.2332
Manual - Sonic 0.418 1.58 20 0.264 0.9933
Manual - Ultrasonic -2.220 1.58 20 -1.401 0.5130
Oscillating - Sonic -2.698 1.58 20 -1.703 0.3480
Oscillating - Ultrasonic -5.337 1.58 20 -3.369 0.0149
Sonic - Ultrasonic -2.638 1.58 20 -1.666 0.3670
P value adjustment: tukey method for comparing a family of 4 estimates
With a p-value of 0.0149, only the Ultrasonic-Oscillating
contrast is significant at the 0.05 level. This significance is driven by the large difference in means between the two levels, Ultrasonic’s mean is 5.337 higher than Oscillating. All other pairwise comparisons have p-values greater than 0.05 and so are not considered significant.
If confidence intervals are desired instead of p-values, the output of the contrast()
command can be saved to an object, which then becomes the input to the confint()
command. This is true for ANY contrasts computed with emmeans::contrast()
, not just Tukey and not just pairwise comparisons. The adjustment specified in contrast()
is then applied to the confidence interval as well.
contrast estimate SE df lower.CL upper.CL
Manual - Oscillating 3.117 1.58 20 -1.32 7.550
Manual - Sonic 0.418 1.58 20 -4.02 4.852
Manual - Ultrasonic -2.220 1.58 20 -6.65 2.214
Oscillating - Sonic -2.698 1.58 20 -7.13 1.735
Oscillating - Ultrasonic -5.337 1.58 20 -9.77 -0.903
Sonic - Ultrasonic -2.638 1.58 20 -7.07 1.795
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 4 estimates
First, look at the ANOVA summary table to see if the F test for brush is significant.
Df Sum Sq Mean Sq F value Pr(>F)
Brush 3 86.31 28.769 3.822 0.0258 *
Residuals 20 150.56 7.528
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for Brush
is significant at the 0.05 level. You can then proceed with an un-adjusted test of all pairwise comparisons for the Brush
factor. The adjustment argument in the contrast()
command in this case is none
.
contrast estimate SE df t.ratio p.value
Manual - Oscillating 3.117 1.58 20 1.967 0.0632
Manual - Sonic 0.418 1.58 20 0.264 0.7944
Manual - Ultrasonic -2.220 1.58 20 -1.401 0.1764
Oscillating - Sonic -2.698 1.58 20 -1.703 0.1040
Oscillating - Ultrasonic -5.337 1.58 20 -3.369 0.0031
Sonic - Ultrasonic -2.638 1.58 20 -1.666 0.1114
The p-value for Oscillating - Ultrasonic
is 0.0031 and is the only significant pairwise comparison at the 0.05 level; though Manual - Oscillating
is close, with a p-value of 0.0632.
If no custom contrasts are provided to the contrast()
command, and no argument for method
is provided, the default output is an estimate (and test) of each factor level effect, as seen below!
contrast estimate SE df t.ratio p.value
Manual effect 0.3287 0.97 20 0.339 0.9273
Oscillating effect -2.7879 0.97 20 -2.874 0.0323
Sonic effect -0.0896 0.97 20 -0.092 0.9273
Ultrasonic effect 2.5488 0.97 20 2.627 0.0323
P value adjustment: fdr method for 4 tests
Note the behavior of the defaults is a little different than what might be expected. “fdr” stands for “false detection rate”. It makes adjustments to keep the overall proportion of Type I errors fixed (e.g. 5% of tests will be a Type I error). This is somewhat less stringent than keeping the family-wise Type I error rate fixed.
You can use confint()
to get confidence intervals for the effects (with or without adjustments) as well.
contrast estimate SE df lower.CL upper.CL
Manual effect 0.3287 0.97 20 -2.333 2.991
Oscillating effect -2.7879 0.97 20 -5.450 -0.126
Sonic effect -0.0896 0.97 20 -2.752 2.573
Ultrasonic effect 2.5488 0.97 20 -0.113 5.211
Confidence level used: 0.95
Conf-level adjustment: bonferroni method for 4 estimates