Introducing the T-Distribution

The Problem: We Don’t Know $\sigma$

In the previous section, we calculated Confidence Intervals using the Z-score formula:

\[CI = \bar{x} \pm z^* \left( \frac{\sigma}{\sqrt{n}} \right)\]

This works perfectly if we know the true population standard deviation ($\sigma$). But in the real world, this is rarely the case.

A reasonable idea is to use the sample standard deviation, $s$, instead of the population standard deviation, $\sigma$, which would look like:

\[\text{Reasonable but incorrect CI} = \bar{x} \pm z^* \left( \frac{s}{\sqrt{n}} \right)\]

Unfortunately, there’s no such thing as a free lunch. We can no longer use percentiles, $z^*$, from a standard normal distribution to calculate the confidence interval.

Student’s T-Distribution

When we replace $\sigma$ with $s$, we introduce extra uncertainty. The $t$-distribution was created specifically to account for that additional uncertainty.

The $t$-distribution is similar to the standard normal distribution. They are both symmetric and centered at zero, but $t$ changes shape depending on how much data we have. The more data we have, the closer the $t$ is to the standard normal distribution.

Technically, the shape of the $t$-distribution depends on something called degrees of freedom.

DEFINITION: Degrees of freedom is defined as $df = n-1$ where $n$ is our sample size.

Visualizing the Difference

The $t$-distribution looks like the Normal, $z$, distribution, but it has heavier tails. Below are several examples of $t$-distributions for different values of degrees of freedom.

Confidence Intervals with the $t$-Distribution

We can now derive the proper formula for a confidence interval when $\sigma$ is unknown using percentiles from a given $t^*$:

\[Correct \text{ CI} = \bar{x} \pm t^*\frac{s}{\sqrt{n}}\]

where $t^*$ is selected based on the desired level of confidence. We use $\pm t^*$ that corresponds to the area under the curve of a $t$-distribution with the appropriate number of degrees of freedom, $n-1$.

The following table shows $t$ values for different scenarios:

$df$	$\text{90\% Conf}$	$\text{95\% Conf}$	$\text{99\% Conf}$
5	2.015	2.571	4.032
10	1.812	2.228	3.169
50	1.676	2.009	2.678
100	1.660	1.984	2.626
$\infty$ (Z-score)	1.645	1.960	2.576

Confidence Intervals in R

Fortunately, R handles all these calculations automatically. There’s no need to look up these values. For now, it’s only important to understand a little about what’s happening behind the scenes in R.

The t.test() function in R can be used to calculate confidence intervals based on the $t$-distribution. Generically, we input the sample data we want to use to create a confidence interval for the population mean, and specify the confidence level (as a decimal) as follows:

t.test(data$y, conf.level = 0.95)$conf.int

NOTE: The t.test() function will be used for more than confidence intervals and will provide lots of output that isn’t currently needed. For now, we can use the selector, $, to instruct R to only return the confidence interval and avoid the unnecessary output.

Let’s create a 99% confidence interval for the average extroversion score for Math 221 students.

library(tidyverse)
library(mosaic)
library(rio)
library(car)


# Read in data
big5 <- import('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/All_class_combined_personality_data.csv')

Using the t.test() function:

t.test(big5$Extroversion, conf.level = 0.99)$conf.int

[1] 54.26631 59.69903
attr(,"conf.level")
[1] 0.99

Interpretation: We are 99% confident that the true mean Extroversion percentile of Math 221 students is between 54.27 and 59.70.

Your Turn

Creating a 93% confidence interval for the true mean Neuroticism percentile of Math 221 students:

t.test()

Error in t_test.default(): argument "x" is missing, with no default

Introducing the T-Distribution

The Problem: We Don’t Know \(\sigma\)

Student’s T-Distribution

Visualizing the Difference

Confidence Intervals with the \(t\)-Distribution

Confidence Intervals in R

Your Turn