Independent Two-sample T-test

Introduction

In many situations we would like to compare averages from different populations. In these situations, we take 2 random samples from each population and perform statistical tests to determine if the population means are significantly different. Because these two groups of individuals are sampled independently, we call this analysis Independent 2-Sample t-test.

Alternatively, in many experimental designs, participants are randomly assigned into a treatment and a control group. The randomization process ensures that there is no association between participants in either group. They are independent.

When 2 random samples are taken from 2 separate populations, or when a group of people are randomly assigned into treatment groups, the samples are independent.

Some examples include:

Comparing salaries of men and women (randomly sample men and women separately)
Testing a new medication compared to a placebo (participants randomly assigned to treatment groups)
Comparing average GPA of Math majors and Economics majors (randomly select from each population)

Hypothesis Tests for Independent Samples

The null hypothesis test for independent samples is:

$H_{o} : μ_{1} = μ_{2}$ The alternative hypothesis depends on context of the research question and can be, as before:

$H_{a} : μ_{1} (>, <, \neq) μ_{2}$

Independent t-test in R

A 2-sample independent t-test in R requires a slight modification to the 1-sample t-test previously discussed. However, the syntax should look familiar.

Recall that when we created a boxplot() or did favstats() for one set of data it looked like:

boxplot(data$response_variable)
favstats(data$response_variable)

with data$response_variable corresponding to our quantitative variable of interest.

When we wanted to break the analysis down by a grouping factor we used the ~ notation to add a group variable:

boxplot(data$response_variable ~ data$grouping_variable)
favstats(data$response_variable ~ data$grouping_variable)

We use the exact same modification for a t-test with 2 groups:

t.test(data$response_variable ~ data$grouping_variable, alterntive = "greater")

The t.test() function uses mu=0 as a default, we do not need to specify it in the function because when comparing 2 groups, the difference is assumed to be 0 in most cases.

When specifying an alternative hypothesis in R, a reference group must be defined. This serves as the baseline for comparison. R will assess how the mean of the second group differs from that of the first.

By default, R decides which group is the reference group alphabetically. Whichever category label is first alphabetically is the reference group.

NOTE: In R, group 1 and 2 are determined alphabetically according to the labels in the explanatory variable column.

EXAMPLE: If my explanatory variable is sex, and they are labeled “Female” and “Male”, then “Female” would be the reference group. If I suspect that Females had smaller feet than males I would use alternative = "less".

Checking Requirements

When we looked at 1-sample t-tests, we had to check that the sample mean was normally distributed. Recall that the sample mean is approximately normal if:

The source population is normally distributed
The sample size, $n \geq 30$

For a 2-sample t-test, we must check that this is true for both samples. If both sample sizes are bigger than 30 then we can trust the CLT and the statistics based on it.

If the sample sizes are small, we can make a qqPlot() for both groups:

qqPlot(data$response_variable ~ data$explanatory_variable)

Confidence Intervals

Recall that confidence intervals are necessarily two-sided. So the code for a 99% confidence interval looks like:

t.test(data$response_variable ~ data$grouping_variable, conf.level = .99)$conf.int

We interpret a confidence interval for the difference of means as follows:

I am 99% confident that the true difference of the means is between [lower limit] and [upper limit].

We can usually do better within the context of a research question:

Class 1 did, on average, between 3.21 and 5.67 percent better than class 2 on the last exam.
Store A outperforms Store B by between $27,022 and $36,977 on average

Practice Together

Does the frequency of heart attacks increase during the World Cup?

The number of heart attacks in the Greater Munich area was measured before and during the period when Germany hosted the FIFA World Cup. This study was published in the New England Journal of Medicine

Step 1: Read in Data

# Load Libraries

library(tidyverse)
library(mosaic)
library(rio)
library(car)


# Load Data

fifa_heart_attacks <- import("https://byuistats.github.io/M221R/Data/fifa_heart_attacks.xlsx")

Step 2: Review Data

Look at the data.

Create summary statistics tables of the number of heart attacks for each group.

Create a side-by-side boxplot for the during the World Cup and the Control.

Do you notice any outliers or data that may need to be omitted for analysis?

Check to see if the means from both groups are normally distributed:

Is n > 30 for both groups?
Create a qqPlot()

qqPlot(fifa_heart_attacks$heart_attacks ~ fifa_heart_attacks$time_period)

Can we trust that the central limit theorem applies?

Step 3: Prepare Data for Analysis

These data look ready for analysis.

Step 4: Perform the appropriate analysis

Hypothesis Test

Are the individuals in each group dependent or independent of each other?

Write out your null and alternative hypotheses.

Ho: Ha:

Which group is considered group 1 and which is group 2 in R?

Check the alphabetical order:

unique(fifa_heart_attacks$time_period)

[1] "Control"   "World Cup"

Perform the appropriate t-test.

What is your test statistic?

What is your p-value?

State your conclusion:

97% Confidence interval

Calculate the 97% confidence interval for the difference of the means.

In context of the research question, interpret the confidence interval.

Practice with Classmates

New Zealand Rugby

Rugby is a popular sport in the United Kingdom, France, Australia, New Zealand and South Africa. It is gaining popularity in the US, Canada, Japan and parts of Europe. Some of the rules of the game have recently been changed to make play more exciting. In a study to examine the effects of the rule changes, Hollings and Triggs (1993) collected data on some recent games.

Typically, a game consists of bursts of activity that terminate when points are scored, if the ball is moved out of the field of play or if a violation of the rules occurs. In 1992, the investigators gathered data on ten international matches which involved the New Zealand national team, the All Blacks. The first five games were the last international games played under the old rules, and the second set of five were the first internationals played under the new rules.

For each of the ten games, the data give the successive times (in seconds) of each passage of play in that game.

You will investigate whether the mean duration of the passages has dropped under the new rules.

Use a level of significance of 0.01.

Load the Data

rugby <- import("https://byuistats.github.io/M221R/Data/quiz/R/nz_rugby.csv")

Explore the Data

Create a side-by-side boxplot for the amount of reported passage of play before and after the rule changes.

Add a title and change the colors of the boxes.

What do you observe?

Create a table of summary statistics of play time for before and after the rule change. (favstats()):

Perform the Appropriate Analysis

Hypothesis Test

State your null and alternative hypotheses:

NOTE: The default for R is to set group order alphabetically. This means Group 1 = NewRules

Compare the the time per play under the new and old rules:

qqPlot(rugby$time, groups = rugby$period)

Do the data for each group appear normally distributed?

Why is it OK to continue with the analysis?

Perform a t-test.

What is the value of the test statistic?

How many degrees of freedom for this test?

What is the p-value?

What do you conclude?

Confidence Interval

Create a confidence interval for the difference of the average Importance Score between both groups:

Chronic Obstructive Pulmonary Disease Treatment

The CDC provided the following information about COPD:

“Chronic obstructive pulmonary disease, or COPD, refers to a group of diseases that cause airflow blockage and breathing-related problems. It includes emphysema and chronic bronchitis. COPD makes breathing difficult for the 16 million Americans who have this disease. (Source: https://www.cdc.gov/copd, accessed December 1, 2022.)”

A study was conducted in which COPD patients walked as many steps as they could. They were then randomly assigned to either a hospital-based or community-based treatment program. At the conclusion of the program, the number of steps the patients could walk without stopping was measured again. The difference in the number of steps (post - pre) is recorded in the data frame copd_rehab.

Step 1: Read in the data

copd <- import("https://byuistats.github.io/M221R/Data/copd_rehab.xlsx") %>% pivot_longer(cols=c("community", "hospital"), names_to = "Treatment", values_to = "Steps") %>% select(Treatment, Steps) %>% arrange(Treatment)

Step 2: Review the data

Create side-by-side boxplots and summary statistics for the community and hospital groups:

Check to see if the means are expected to be normally distributed.

Can trust the CLT for our test statistic and P-value?

Step 3: Prepare Data for Analysis

The data cleansing has been performed for you. You’re welcome.

Step 4: Perform the appropriate analysis

Hypothesis Test

State your null and alternative hypotheses.

Ho:

Ha:

Which group is considered group 1 in this data?

Run the appropriate t-test.

#t.test()

State your conclusion about the hypothesis test.

Confidence Interval

Create a 95% confidence interval for the difference between the means

Interpret the 95% confidence interval for the mean difference between the community-based and hospital-based groups.