Unit 1 introduced you to some foundational ideas in statistical theory. A firm grasp of these concepts will allow you to connect deeply with the statistical tools that you will be presented with in Units 2, 3, and 4. Use this page to review the Lesson Outcomes and Summaries in preparation for the Unit 1 Exam. If there are any lesson outcomes that you are not confident with, return to that lesson and review those outcomes in detail.
The Statistical Process has five steps: Design the study, Collect the data, Describe the data, Make inference, Take action. These can be remembered by the pneumonic “Daniel Can Discern More Truth.”
In a designed experiment, researchers control the conditions of the study, typically with a treatment group and a control group, and then observe how the treatments impact the subjects. In a purely observational study, researchers don’t control the conditions but only observe what happens.
The population is the entire group of all possible subjects that could be included in the study. The sample is the subset of the population that is actually selected to participate in the study. Statistics use information from the sample to make claims about what is true about the entire population.
There are many sampling methods used to obtain a sample from a population. The best methods use some sort of randomness (like pulling names out of a hat, rolling dice, flipping coins, or using a computer generated list of random numbers) to avoid bias.
The best way to avoid bias when trying to make conclusions about a population from a single sample of that population is to use a random sampling method to obtain the sample.
Quantitative variables represent things that are numeric in nature, such as the value of a car or the number of students in a classroom. Categorical variables represent non-numerical data that can only be considered as labels, such as colors or brands of shoes.
Histograms are created by dividing the number line into several equal parts, starting at or below the minimum value occurring in the data and ending at or above the maximum value in the data. The number of data points occurring in each interval (called a bin) are counted. A bar is then drawn for each bin so that the height of the bar shows the number of data points contained in that bin.
A histogram allows us to visually interpret data to quickly recognize which values are most common and which values are least common in the data.
Histograms can be left-skewed (the majority of the data is on the right of the histogram, less common values stretch to the left side), right-skewed (majority of the data is on the left side with less common values stretching to the right), or symmetrical and bell-shaped (most data is in the middle with less common values stretching out to either side).
The mean, median, and
mode are measures of the center of a distribution. The
mean is the most common measure of center and is computed by adding up
the observed data and dividing by the number of observations in the data
set. The median represents the 50th percentile in the data. The mean can
be calculated in Excel using =AVERAGE(...)
, the median by
using =MEDIAN(...)
, and the mode by =MODE(...)
where the ...
in each case consists of the cell references
that highlight the data.
When comparing the centers of distributions using graphical and numerical summaries, the direction of the skew showing in the histogram will generally correspond with the mean being pulled in that direction.
Right-skewed | Symmetric & Bell-shaped | Left-skewed |
---|---|---|
Mean: $10.45 Median: $9.04 Mean is to the right of the median. |
Mean: 71.1 inches Median: 71 inches Mean and median are roughly equal. |
Mean: 3.42 Median: 3.45 Mean is to the left of the median. |
In a symmetrical and bell-shaped distribution of data, the mean, median, and mode are all roughly the same in value. However, in a skewed distribution, the mean is strongly influenced by outliers and tends to be pulled in the direction of the skew. In a left-skewed distribution, the mean will tend to be to the left of the median. In a right-skewed distribution, the mean will tend to be to the right of the median.
A parameter is a true (but usually unknown) number that describes a population. A statistic is an estimate of a parameter obtained from a sample of the population.
A percentile is calculated in Excel using
=PERCENTILE(..., 0.#)
where the 0.#
is the
percentile written as a decimal number. So the 20th percentile would be
written as 0.2.
A percentile is a number such that a specified percentage of the data are at or below this number. For example, if say 80% of college students were shorter than (or equal to) 70 inches tall in height, then the 80th percentile of heights of college students would be 70 inches.
Standard deviation is calculated in Excel for a
sample of data using =STDEV.S(...)
.
The standard deviation is a number that describes how spread out the data typically are from the mean of that data. A larger standard deviation means the data are more spread out from their mean than data with a smaller standard deviation. The standard deviation is never negative. A standard deviation of zero implies all values in the data set are exactly the same.
To compute any of the five-number summary values
in Excel, use the Excel function =QUARTILE.INC(..., #)
where #
is either a 0 (gives the minimum), 1 (gives the
first quartile), 2 (gives the second quartile, i.e., median), 3 (gives
the third quartile), or 4 (gives the maximum).
The five-number summary consists of (1) the minimum value in the data, (2) the first quartile (25th percentile) of the data, (3) the median of the data (50th percentile), (4) the third quartile (75th percentile) of the data, and (5) the maximum value occurring in the data.
To create a boxplot in Excel, highlight the data, go to Insert on the menu ribbon, choose the histogram icon, select the Boxplot option from the menu that appears.
Boxplots are a visualization of the five-number summary of a data set.
A normal density curve is symmetric and bell-shaped with a mean of \(\mu\) and a standard deviation of \(\sigma\). The curve lies above the horizontal axis and the total area under the curve is equal to 1. A standard normal distribution has a mean of 0 and a standard deviation of 1.
A z-score is calculated as: \(\displaystyle{z = \frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{x-\mu}{\sigma}}\)
A z-score tells us how many standard deviations above (\(+Z\)) or below (\(-Z\)) the mean (\(\mu\)) a given value (\(x\)) is.
To calculate probabilities for an observation \(x\), calculate the \(z\)-score using \(\mu\), \(\sigma\), and \(x\) and then use the Normal Probability Applet to shade the appropriate area of the distribution for the desired probability. The area shaded depends on both the direction of interest (above, below, between) and the sign of the z-score as depicted in the images below. In every case, the probability is given by the Area box at the top of the applet.
The 68-95-99.7% rule states that when data are normally distributed, approximately 68% of the data lie within \(z=1\) standard deviation (\(\sigma\)) from the mean, approximately 95% of the data lie within \(z=2\) standard deviations from the mean, and approximately 99.7% of the data lie within \(z=3\) standard deviations from the mean. For example, this rule approximates that 2.5% of observations will be less than a z-score of \(z=-2\).
Percentiles can be calculated using the Normal Probability Applet by (1) shading the left tail only, (2) entering the desired percentile in the “Area” box, and (3) using the z-score from where the blue shaded region ends solve for \(x\) in the equation \(z=\frac{x-\mu}{\sigma}\).
The distribution of sample means is a distribution of all possible sample means (\(\bar x\)) for a particular sample size.
The Central Limit Theorem states that the sampling distribution of the sample mean will be approximately normal if the sample size \(n\) of a sample is sufficiently large. In this class, \(n\ge 30\) is considered to be sufficiently large.
The mean of the distribution of sample means is the mean \(\mu\) of the population: \(\mu_{\bar{x}} = \mu\).
The standard deviation of the distribution of sample means is the standard deviation \(\sigma\) of the population divided by the square root of \(n\): \(\sigma_{\bar{x}} = \sigma/\sqrt{n}\).
The distribution of sample means is normal in either of two situations: (1) when the data is normally distributed or (2) when, thanks to the Central Limit Theorem (CLT), our sample size (\(n\)) is large.
The Law of Large Numbers states that as the sample size (\(n\)) gets larger, the sample mean (\(\bar x\)) will get closer to the population mean (\(\mu\)). This can be seen in the equation for \(\sigma_{\bar{x}} = \sigma/\sqrt{n}\). Notice as \(n\) increases, then \(\sigma_\bar{x}\) will get smaller.
A z-score for a sample mean is calculated as: \(\displaystyle{z = \frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{\bar x-\mu}{\sigma/\sqrt{n}}}\)
When the distribution of sample means is normally distributed, we can use a z-score and the Normal Probability Applet to calculate the probability that a sample mean is above, below, between, or more extrem than some given value (or values).
Copyright © 2020 Brigham Young University-Idaho. All rights reserved.