Probability Calculations for Means (Reading)

Review of Sampling Distributions

When is the sample mean normally distributed? This happens when either of the two conditions are satisfied:

  • The population is normally distributed, so the sample mean is automatically normally distributed.
  • The sample size is large, and the Central Limit Theorem implies that the sample mean is normally distributed.

For the mean of draws from a random variable with mean \(\mu\) and standard deviation \(\sigma\), the following are true:

  • The mean of the random variable \(\bar X\) is \(\mu\).
  • The standard deviation of the random variable \(\bar X\) is \(\displaystyle{\frac{\sigma}{\sqrt{n}}}\).

Previously, you learned how to use the normal probability applet and R to convert a \(z\)-score to a corresponding area under the curve. These skills will be applied again in this activity. The following questions will help you brush up on how to calculate probabilities for a normal distribution.


Answer the following questions:
  1. Look at the output from the applet, given above. What is the probability that a standard normal random variable will be below 1.25?
Show/Hide Solution
\(0.8944\)

1a. Compare this to the R output using pnorm()

Show/Hide Solution
pnorm(1.25)
[1] 0.8943502
  1. Look at the output from the applet given above. What is the probability that a standard normal random variable will be above 1.25?
Show/Hide Solution
\(1 - 0.8944 = 0.1056\)
  1. Challenge problem: Use R to determine the probability that a standard normal random variable will be between -1.3 and 1.25.
Show/Hide Solution
pnorm(1.25) - pnorm(-1.3)
[1] 0.7975497
  1. Use R to find the probability that a standard normal random variable will be greater than -0.75.
Show/Hide Solution
1-pnorm(-.75)
[1] 0.7733726
# Equivalantly 
pnorm(-.75, lower.tail = FALSE)
[1] 0.7733726
  1. If the mean of a normally distributed random variable is -23 and the standard deviation is 7, what is the probability that the random variable will have a value that is less than -25?
Show/Hide Solution
  • The \(z\)-score is:
    \(\displaystyle{z = \frac{-25 - (-23)}{7} = -0.2857}\)

    Using R:
x <- -25
mu <- -23
sigma <- 7

z <- (x-mu)/sigma
z
[1] -0.2857143
pnorm(z)
[1] 0.3875485
*We find that the area to the left of \(z\) = -0.2857 is \(0.3876\).


Probabilities Involving a Mean

If the sample mean is normally distributed, then we can consider the sample mean, \(\bar x\), as one observation from a normal distribution with mean \(\mu\) and standard deviation \(\displaystyle{\frac{\sigma}{\sqrt{n}}}\). With this in mind, we can compute the \(z\)-score for any value of this random variable:

\[z = \frac{\text{value} - \text{mean}}{\text{standard deviation}}=\frac{\bar x-\mu}{\sigma / \sqrt{n} }\]

Notice that we just replaced the “value” with the normal random variable, the “mean” with its mean and “standard deviation” with its standard deviation.

The \(z\)-score follows a standard normal distribution. That is, it has a mean of 0 and a standard deviation of 1. Now we can simply input the \(z\)-score into pnorm() to get the probability that a randomly selected mean will be above or below a given value of \(\bar x\).

Worked Example: Finding the Area under a Normal Curve (Based on a Sample Mean)

After finding the \(z\)-score, we can use pnorm() to find the area under the curve (i.e. the probability.) Suppose that a sample of size \(n = 4\) has been drawn from a normal population with mean \(\mu = 7\) and standard deviation \(\sigma = 3\). Since the parent population is normal, we know the sampling distribution of the sample mean \(\bar x\) will be normal with mean \(\mu = 7\) and standard deviation \(\displaystyle{ \frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{4}} = 1.5}\). We can use this information to find the probability that \(\bar x\) will be greater than 10.

The \(z\)-score is

\[z=\frac{\bar x-\mu}{\sigma / \sqrt{n} } = \frac{10 - 7}{3 / \sqrt{4} }=2.0\]

The probability that \(z\) will be greater than \(2.00\) is the area under the standard normal distribution to the right of \(2.00\). We find this area using the normal probability using R:

xbar <- 10
mu <- 7
sigma <- 3
n <- 4
sigma_xbar <- sigma/sqrt(n)

z <- (xbar-mu)/sigma_xbar
z
[1] 2
#Area to the left:
pnorm(z)
[1] 0.9772499
#Area to the right:
1-pnorm(z)
[1] 0.02275013

The area to the right of \(z = 2.00\) is \(0.02275\). This is the probability that the random sample of \(n = 4\) items will have a mean that is greater than \(10\).

In this example, the sample mean was automatically normally distributed because the parent population was normally distributed. This will always be true, no matter what size sample is drawn.

Worked Example: Finding the Area under a Normal Curve (Based on a Sample Mean)

What do we do if the parent population is not normally distributed? If the sample size is large, then the Central Limit Theorem guarantees that the sample mean will be approximately normally distributed. Based on this, we can still do normal probability calculations for the mean of a random sample.

The distribution of the weekly costs incurred by Global Solutions Unlimited is right skewed. The population mean of the costs is $26,400 and the standard deviation is $23,200. A random sample of \(n = 40\) weeks is to be selected, what is the probability that the mean weekly costs will be less than $20,000?

Since the number of observations is large, the Central Limit Theorem assures that the sample mean will follow a normal distribution. The \(z\)-score for \(\bar x =\) $20,000 is:

\[z = \frac{\bar x-\mu}{\sigma / \sqrt{n} } = \frac{20,000-26,400}{23,200 / \sqrt{40}} = -1.745 \]

Using R, we can find the area to the left of this \(z\)-score:

xbar <- 20000
mu <- 26400
sigma <- 23200
n <- 40
sigma_xbar <- sigma/sqrt(n)

z <- (xbar-mu) / sigma_xbar
z
[1] -1.744705
#Area to the left:
pnorm(z)
[1] 0.04051812
#Area to the right:
1-pnorm(z)
[1] 0.9594819

NOTE: Recall the the Normal Probabilty Applet rounds the \(z\)-scores first, then calculates the probabilities so they differ slightly from R.

The area to the left of \(z= -1.745\) is \(0.04052\). This is the probability that the mean costs for the \(n = 40\) weeks will be less than $20,000.


Example: Environmental Clean Up

We will now consider a complete example that shows how these probabilities are used in practice.

The United States Government decided to open some land near a uranium enrichment facility to public use. After a few years of the public hiking, biking, and sometimes even hunting on this land, workers from the facility noticed that there were several unnatural-looking mounds in the earth near the area. Because this land was once used by the facility and nobody knew the origin of these piles, the government closed public access to the land until they could assess if the mounds were safe.


Step 1: Design the study.

Measurements were taken from the mounds to assess one of the contaminants, lead. The tests involved are very expensive. Each sample costs about $600 to process. The Environmental Protection Agency (EPA) has set a “No Action Level” (NAL) for lead. If the mean concentration in the soil of the contaminant is less than the NAL, then the area can be declared safe for public use. If the concentration of the contaminant reaches or exceeds the NAL, the site must be cleaned additionally before it is declared safe. The NAL for lead is 50 milligrams of lead per kilogram of soil (mg/kg).

The hypothesis for this test are:

\(H_0:~~\mu = 50 \frac{\text{mg}}{\text{kg}}\)

\(H_a:~~\mu < 50 \frac{\text{mg}}{\text{kg}}\)

In environmental testing, we always assume the site is dirty. That is, our null hypothesis is that the mean level of contamination is at the NAL. We gather data to determine if there is sufficient evidence to support rejecting the null hypothesis.


Step 2: Collect Data

Scientists collected \(n = 61\) measurements of the lead concentration in the soil, measured in mg/kg. The data are given in the file Uranium Plant Data-Lead.


Step 3: Describe the Data

Answer the following question:
  1. Create a histogram of the lead contamination data.
Show/Hide Solution
library(rio)
library(mosaic)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'
The following objects are masked from 'package:dplyr':

    count, do, tally
The following object is masked from 'package:Matrix':

    mean
The following object is masked from 'package:ggplot2':

    stat
The following object is masked from 'package:rio':

    factorize
The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var
The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
uranium <- import('https://github.com/byuistats/BYUI_M221_Book/raw/master/Data/UraniumPlantData-Lead.xlsx')

histogram(uranium$`Lead (mg/kg)`, xlab="Lead (mg/kg)")

  1. What is the shape of the distribution of the data?
Show/Hide Solution
  • The lead concentration data are right skewed.
  1. Are any of the measured values above the NAL (50 mg/kg)? What might this suggest?
Show/Hide Solution
  • Yes, there are several observations that are above the NAL. This may suggest that the lead concentrations are very high in some isolated soil samples. It may also be an indication of variability in the test results.
  1. Visually, does it look like the mean lead concentration is less than 50 mg/kg?
Show/Hide Solution
  • Answers may vary.
  1. What is the value of the sample mean?
Show/Hide Solution
knitr::kable(favstats(uranium$`Lead (mg/kg)`))
min Q1 median Q3 max mean sd n missing
9.8 14.1 19.8 33.1 115 29.13279 23.26021 61 0
  • The sample mean is 29.13 mg/kg.


Step 4: Make Inferences

We assume the null hypothesis is true and we gather evidence against this requirement. We will find the probability that the mean lead concentration is less than \(\bar x\) = 29.13 mg/kg, assuming that the true mean lead concentration is \(\mu\) = 50 mg/kg. This probability is called the \(P\)-value.

Since the sample size is large (\(n = 61\)), we can conclude that the sample mean is normally distributed. So, we can use R to find the probability that the sample mean will be less than 29.13. Historical analyses show that the population standard deviation is \(\sigma\) = 24 mg/kg.

First, we compute the \(z\)-score.

\[z=\frac{\bar x-\mu}{\sigma / \sqrt{n} } = \frac{29.13 - 50}{24 / \sqrt{61} }=-6.79\]

x <- 29.13
mu <- 50
sigma <- 24
n <- 61
sigma_xbar <- sigma/sqrt(n)

z <- (x-mu)/sigma_xbar
z
[1] -6.791663
pnorm(z)
[1] 5.542414e-12

Using the pnorm() function, we find the area to the left of \(z = -6.79\). This is the \(P\)-value. It is not clear from the image of the applet, but the area to the left was shaded before the \(z\)-score was entered.

The \(P\)-value for this test is so small we have to resort to scientific notation. \(5.54 \times 10^{-12} = 0.000~000~000~005~6\). This is a very small probability. Assuming the null hypothesis is true-that is, the site is unacceptably contaminated-it is very unlikely that we would find such a low mean contamination level among the \(n = 61\) randomly selected soil samples.

Since the \(P\)-value is low, we reject the null hypothesis.


Step 5: Take Action

There is sufficient evidence to suggest that the mean lead level is less than 50 mg/kg. We conclude that the lead concentration in the soil is low enough that it is not a danger to the public. Based on the results of this and other similar test results, the government has reopened public access to this area.


Review of Key Concepts

  1. To compute probabilities involving the sample mean \(\bar{x}\) we must first determine if \(\bar{x}\) is normally distributed. If it is not, we cannot compute probabilities. There are two cases when \(\bar{x}\) is guaranteed to be normally distributed. They are:
  • The parent population is Normally distributed.

  • The Central Limit Theorem guarantees the distribution of \(\bar{x}\) is Normal if the sample size \(n\) is large enough. For this course, we require \(n \geq 30\).

  1. The collection of all possible sample means \(\bar{x}\) is also a population. The mean of the distribution of sample means is the population mean \(\mu\). The standard deviation of the distribution of sample means is \(\frac{\sigma}{\sqrt{n}}\), where \(\sigma\) is the parent population standard deviation and \(n\) is the sample size.

  2. Once we have determined that the sample mean is Normally distributed, we can compute probabilities with \(\bar{x}\). We first compute \(z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}\) and then use R to compute probabilities about \(z\).


Summary

Remember…
  1. A z-score for a sample mean is calculated as: \(\displaystyle{z = \frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{\bar x-\mu}{\sigma/\sqrt{n}}}\)

  2. When the distribution of sample means is normally distributed, we can use a z-score R to calculate the probability that a sample mean is above, below or between some given value (or values).