Introducing the T-Distribution

The Problem: We Don’t Know \(\sigma\)

In the previous section, we calculated Confidence Intervals using the Z-score formula:

\[CI = \bar{x} \pm z^* \left( \frac{\sigma}{\sqrt{n}} \right)\]

This works perfectly if we know the true population standard deviation (\(\sigma\)). But in the real world, this is rarely the case.

A reasonable idea is to use the sample standard deviation, \(s\), instead of the population standard deviation, \(\sigma\), which would look like:

\[\text{Reasonable but incorrect CI} = \bar{x} \pm z^* \left( \frac{s}{\sqrt{n}} \right)\]

Unfortunately, there’s no such thing as a free lunch. We can no longer use percentiles, \(z^*\), from a standard normal distribution to calculate the confidence interval.

Student’s T-Distribution

When we replace \(\sigma\) with \(s\), we introduce extra uncertainty. The \(t\)-distribution was created specifically to account for that additional uncertainty.

The \(t\)-distribution is similar to the standard normal distribution. They are both symmetric and centered at zero, but \(t\) changes shape depending on how much data we have. The more data we have, the closer the \(t\) is to the standard normal distribution.

Technically, the shape of the \(t\)-distribution depends on something called degrees of freedom.

DEFINITION: Degrees of freedom is defined as \(df = n-1\) where \(n\) is our sample size.

Visualizing the Difference

The \(t\)-distribution looks like the Normal, \(z\), distribution, but it has heavier tails. Below are several examples of \(t\)-distributions for different values of degrees of freedom.

Confidence Intervals with the \(t\)-Distribution

We can now derive the proper formula for a confidence interval when \(\sigma\) is unknown using percentiles from a given \(t^*\):

\[Correct \text{ CI} = \bar{x} \pm t^*\frac{s}{\sqrt{n}}\]

where \(t^*\) is selected based on the desired level of confidence. We use \(\pm t^*\) that corresponds to the area under the curve of a \(t\)-distribution with the appropriate number of degrees of freedom, \(n-1\).

The following table shows \(t\) values for different scenarios:

\(df\) \(\text{90\% Conf}\) \(\text{95\% Conf}\) \(\text{99\% Conf}\)
5 2.015 2.571 4.032
10 1.812 2.228 3.169
50 1.676 2.009 2.678
100 1.660 1.984 2.626
\(\infty\) (Z-score) 1.645 1.960 2.576

Confidence Intervals in R

Fortunately, R handles all these calculations automatically. There’s no need to look up these values. For now, it’s only important to understand a little about what’s happening behind the scenes in R.

The t.test() function in R can be used to calculate confidence intervals based on the \(t\)-distribution. Generically, we input the sample data we want to use to create a confidence interval for the population mean, and specify the confidence level (as a decimal) as follows:

t.test(data$y, conf.level = 0.95)$conf.int

NOTE: The t.test() function will be used for more than confidence intervals and will provide lots of output that isn’t currently needed. For now, we can use the selector, $, to instruct R to only return the confidence interval and avoid the unnecessary output.

Let’s create a 99% confidence interval for the average extroversion score for Math 221 students.

library(tidyverse)
library(mosaic)
library(rio)
library(car)


# Read in data
big5 <- import('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/All_class_combined_personality_data.csv')

Using the t.test() function:

t.test(big5$Extroversion, conf.level = 0.99)$conf.int
[1] 54.26631 59.69903
attr(,"conf.level")
[1] 0.99

Interpretation: We are 99% confident that the true mean Extroversion percentile of Math 221 students is between 54.27 and 59.70.

Your Turn

Creating a 93% confidence interval for the true mean Neuroticism percentile of Math 221 students:

t.test()
Error in t_test.default(): argument "x" is missing, with no default