library(tidyverse)
library(mosaic)
library(rio)
library(car)
# Read in data
big5 <- import('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/All_class_combined_personality_data.csv')Introducing the T-Distribution
The Problem: We Don’t Know \(\sigma\)
In the previous section, we calculated Confidence Intervals using the Z-score formula:
\[CI = \bar{x} \pm z^* \left( \frac{\sigma}{\sqrt{n}} \right)\]
This works perfectly if we know the true population standard deviation (\(\sigma\)). But in the real world, this is rarely the case.
A reasonable idea is to use the sample standard deviation, \(s\), instead of the population standard deviation, \(\sigma\), which would look like:
\[\text{Reasonable but incorrect CI} = \bar{x} \pm z^* \left( \frac{s}{\sqrt{n}} \right)\]
Unfortunately, there’s no such thing as a free lunch. We can no longer use percentiles, \(z^*\), from a standard normal distribution to calculate the confidence interval.
Student’s T-Distribution
When we replace \(\sigma\) with \(s\), we introduce extra uncertainty. The \(t\)-distribution was created specifically to account for that additional uncertainty.
The \(t\)-distribution is similar to the standard normal distribution. They are both symmetric and centered at zero, but \(t\) changes shape depending on how much data we have. The more data we have, the closer the \(t\) is to the standard normal distribution.
Technically, the shape of the \(t\)-distribution depends on something called degrees of freedom.
DEFINITION: Degrees of freedom is defined as \(df = n-1\) where \(n\) is our sample size.
Visualizing the Difference
The \(t\)-distribution looks like the Normal, \(z\), distribution, but it has heavier tails. Below are several examples of \(t\)-distributions for different values of degrees of freedom.

Confidence Intervals with the \(t\)-Distribution
We can now derive the proper formula for a confidence interval when \(\sigma\) is unknown using percentiles from a given \(t^*\):
\[Correct \text{ CI} = \bar{x} \pm t^*\frac{s}{\sqrt{n}}\]
where \(t^*\) is selected based on the desired level of confidence. We use \(\pm t^*\) that corresponds to the area under the curve of a \(t\)-distribution with the appropriate number of degrees of freedom, \(n-1\).
The following table shows \(t\) values for different scenarios:
| \(df\) | \(\text{90\% Conf}\) | \(\text{95\% Conf}\) | \(\text{99\% Conf}\) |
|---|---|---|---|
| 5 | 2.015 | 2.571 | 4.032 |
| 10 | 1.812 | 2.228 | 3.169 |
| 50 | 1.676 | 2.009 | 2.678 |
| 100 | 1.660 | 1.984 | 2.626 |
| \(\infty\) (Z-score) | 1.645 | 1.960 | 2.576 |
Confidence Intervals in R
Fortunately, R handles all these calculations automatically. There’s no need to look up these values. For now, it’s only important to understand a little about what’s happening behind the scenes in R.
The t.test() function in R can be used to calculate confidence intervals based on the \(t\)-distribution. Generically, we input the sample data we want to use to create a confidence interval for the population mean, and specify the confidence level (as a decimal) as follows:
t.test(data$y, conf.level = 0.95)$conf.int
NOTE: The t.test() function will be used for more than confidence intervals and will provide lots of output that isn’t currently needed. For now, we can use the selector, $, to instruct R to only return the confidence interval and avoid the unnecessary output.
Let’s create a 99% confidence interval for the average extroversion score for Math 221 students.
Using the t.test() function:
t.test(big5$Extroversion, conf.level = 0.99)$conf.int[1] 54.26631 59.69903
attr(,"conf.level")
[1] 0.99
Interpretation: We are 99% confident that the true mean Extroversion percentile of Math 221 students is between 54.27 and 59.70.
Your Turn
Creating a 93% confidence interval for the true mean Neuroticism percentile of Math 221 students:
t.test()Error in t_test.default(): argument "x" is missing, with no default