Describing Quantitative Data (Spread)

Lesson Outcomes

By the end of this lesson, you should be able to:

Calculate a percentile from data
Interpret a percentile
Calculate the standard deviation from data
Interpret the standard deviation
Calculate the five-number summary using software
Interpret the five-number summary
Create a box plot using software
Determine the five-number summary visually from a box plot

Spread of a Distribution

In the previous lesson, we introduced two important characteristics of a distribution: the shape and the center. In this section, you will discover ways to summarize the spread of a distribution of data. The spread of a distribution of data describes how far the observations tend to be from each other. There are many ways to describe the spread of a distribution, but one of the most popular measurements of spread is called the “standard deviation.”

Standard Deviation and Variance

This activity introduces two measures of spread: the standard deviation and the variance.

Bird Flu Fever

Avian Influenza A H5N1, commonly called the bird flu, is a deadly illness that is currently only passed to humans from infected birds. This illness is particularly dangerous because at some point it is likely to mutate to allow human-to-human transmission. Health officials worldwide are preparing for the possibility of a bird flu pandemic.

Dr. K. Y. Yuen led a team of researchers who reported the body temperatures of people admitted to Chinese hospitals with confirmed cases of Avian Influenza. Their research team collected data on the body temperature at the time that people with the bird flu were admitted to the hospital. In the article, they reported on two groups of people, those with relatively uncomplicated cases of the bird flu and those with severe cases.

The table below presents the data representative of the body temperatures for the two groups of bird flu patients:

Body Temperature	Case Type
38.1	Simple
38.3	Simple
38.4	Simple
39.5	Simple
39.7	Simple
39.1	Severe
39.5	Severe
38.9	Severe
39.2	Severe
39.9	Severe
39.7	Severe
39.0	Severe

Let us focus on the relatively uncomplicated cases. Creating a histogram of such a small dataset does not provide much benefit. With only a handful of values, there is not much shape to the distribution.

We can, however, use numerical summaries to give an indication of the center of the distribution.

Answer the following questions:

What is the median of the body temperatures for the simple cases?

Solution

The median body temperature for the simple cases is 38.4 degrees Centigrade.

What is the mean of the body temperatures for the simple cases?

Solution

The mean body temperature for the simple cases is 38.8 degrees Centigrade.

We will use these data to investigate some measures of the spread in a data set.

There is relatively little difference in the temperatures of the uncomplicated patients. The lowest is $38.1 ^\circ \text{C}$, while the highest temperature is $39.7 ^\circ \text{C}$.

The standard deviation is a measure of the spread in the distribution. If the data tend to be close together, then the standard deviation is relatively small. If the data tend to be more spread out, then the standard deviation is relatively large.

The standard deviation of the body temperatures is $0.742 ^\circ \text{C}$. This number contains information from all the patients. If the patients’ temperatures had been more diverse, the standard deviation would be larger. If the patients’ temperatures were more uniform (i.e. closer together), then the standard deviation would have been smaller. If all the patients somehow had the same temperature, then the standard deviation would be zero.

We are working with a sample. To be explicit, we call $0.742 ^\circ \text{C}$ the sample standard deviation. The symbol for the sample standard deviation is $s$. This is a statistic. The parameter representing the population standard deviation is $\sigma$ (pronounced /SIG-ma/). In practice, we rarely know the value of the population standard deviation, so we use the sample standard deviation $s$ as an approximation for the unknown population standard deviation $\sigma$.

At this point, you probably do not have much intuition regarding the standard deviation. We will use this statistic frequently. By the end of the semester you can expect to become very comfortable with this idea. For now, all you need to know is that if two variables are measured on the same scale, the variable with values that are further apart will have the larger standard deviation.

R Instructions

To calculate the sample standard deviation in R, follow these steps:

data <- c(38.1,38.3,38.4,39.5,39.7,39.1,39.5,38.9,39.2,39.9,39.7,39.0)

sd(data)

[1] 0.5930788

Rounding: As a general rule, when reporting your answers in this class, round to three decimal places unless otherwise specified.

Calculating the Standard Deviation by Hand

How is the standard deviation computed? Where does this “magic” number come from? How does one number include the information about the spread of all the points?

It is a little tedious to compute the standard deviation by hand. You will usually compute standard deviation with a computer. However, the process is very instructive and will help you understand conceptually what the statistic represents. As you work through the following steps, please remember the goal is to find a measure of the spread in a data set. We want one number that describes how spread out the data are.

First, observe the number line below, where each x represents the temperature of a patient with a relatively uncomplicated case of bird flu. As mentioned earlier, there is not a huge spread in the temperatures.

On your sketch of the number line, we draw a vertical line at 38.8 degrees, the sample mean. Now, draw horizontal lines from the mean to each of your $\times$’s. These horizontal line segments represent the spread of the data about the mean. Your plot should look something like this:

The length of each of the line segments represents how far each observation is from the mean. If the data are close together, these lines will be fairly short. If the distribution has a large spread, the line segments will be longer. The standard deviation is a measure of how long these lines are, as a whole.

The distance between the mean and an observation is referred to as a deviation. In other words, deviations are the lengths of the line segments drawn in the image above.

\[ \begin{array}{1cl} \text{Deviation} & = & \text{Value} - \text{Mean} \\ \text{Deviation} & = & x - \bar x \end{array} \]

If the observed value is greater than the mean, the deviation is positive. If the value is less than the mean, the deviation is negative.

The standard deviation is a complicated sort of average of the deviations. Making a table like the one below will help you keep track of your calculations. Please participate fully in this exercise. Writing your answers at each step and developing a table as instructed will greatly enhance the learning experience. By following these steps, you will be able to compute the standard deviation by hand, and more importantly, understand what it is telling you.

Step 01: The first step in computing the standard deviation by hand is to create a table, like the following. Enter the observed data in the first column.

Observation ($x$)	Deviation from the Mean ($x-\bar x$)
$38.1$	$38.1-38.8=-0.7$
$38.3$
$38.4$
$39.5$
$39.7$
$\bar x = 38.8$

Step 02: The second column of the table contains the deviations from the mean. Complete column 2 of the table above.

Observation (\(x\))	Deviation from the Mean (\(x-\bar x\))
\(38.1\)	\(38.1-38.8=-0.7\)
\(38.3\)	\(38.3-38.8=-0.5\)
\(38.4\)	\(38.4-38.8=-0.4\)
\(39.5\)	\(39.5-38.8=0.7\)
\(39.7\)	\(39.7-38.8=0.9\)
\(\bar x = 38.8\)

Check Results for Step 2

Answer the following questions:

How could we use this table to find the “typical” distance from each point to the mean? Think carefully about this, and then write down your answer before continuing.

Solution

You may have suggested that we compute the mean of these values. This seems like a good idea. If we compute the mean, it will tell us the average deviation from the mean.

9b. Compute the mean of Column 2. What do you get?

Solution

You should have found that the mean of the deviations is zero. This is true for every data set. If you add up the deviations from the mean, the positive values will cancel with the negative values. The sum of the deviations from the mean will be zero, so the mean also must equal zero.
The good news is that you can use this fact to check if you are on the right track. If the deviations from the mean do not add up to zero, then you have made a mistake in the calculations. The bad news is that the deviations always add up to 0, making it look like the distance from the data to the mean is 0. Nonsense!
The mean of the deviations from the mean cannot be used to find a measure of the spread in a data set, but it does provide a guidepost that shows we are on the right track. We must find another way to estimate the spread of a data set.

We need a way to work with the negative deviations from the mean, so they do not cancel with the positive ones. What could we do? (Choose one of the four options below.)

Option 1: Take the absolute value of the deviations

This is an excellent suggestion. This is probably one of the first things statisticians used to estimate the spread in the data.
If we take the absolute value of the deviations, then all the values are positive. By taking the mean of these numbers, we do get a measure of spread. This quantity is called the mean absolute deviation (MAD).
There is good news and bad news. The good news is, you discovered a way to estimate the spread in a data set. (In fact, the MAD is used as one estimate of the volatility of stocks.) The bad news is that the MAD does not have good theoretical properties. A proof of this claim requires calculus, and so will not be discussed here. For most applications, there is a better choice. Please select another option.

Option 2: Square the deviations

If we square the deviations from the mean, the values that were negative will become positive. This leads to an estimator of the spread that has excellent theoretical properties. This is the best of the four options. You will apply this idea in Step 03.

Option 3: Delete the negative deviations

Sorry, you can’t make your troubles go away by deleting things you don’t like. Please try again.

Option 4: Do something entirely different

You probably have an ingenious idea. Surprisingly enough, there is a right answer to the question. Please choose a different option.

Please do not go on to Step 03 until you have finished this exploration.

“Piled Higher and Deeper” by Jorge Cham

Step 03: Add a third column to your table. To get the values in this column, square the deviations from the mean that you found in Column 2.

Observation $x$	Deviation from the Mean $x-\bar x$	Squared Deviation from the Mean $\left(x-\bar x\right)^2$
$38.1$	$38.1-38.8=-0.7$
$38.3$	$38.3-38.8=-0.5$
$38.4$	$38.4-38.8=-0.4$
$39.5$	$39.5-38.8=0.7$
$39.7$	$39.7-38.8=0.9$
$\bar x = 38.8$	Sum $=0$

Observation $x$	Deviation from the Mean $x-\bar x$	Squared Deviation from the Mean $\left(x-\bar x\right)^2$
$38.1$	$38.1-38.8=-0.7$	$(-0.7)^2=0.49$
$38.3$	$38.3-38.8=-0.5$	$(-0.5)^2=0.25$
$38.4$	$38.4-38.8=-0.4$	$(-0.4)^2=0.16$
$9.5$	$39.5-38.8=0.7$	$(0.7)^2=0.49$
$39.7$	$39.7-38.8=0.9$	$(0.9)^2=0.81$
$\bar x = 38.8$	Sum $=0$

Step 04: Now, add up the squared deviations from the mean.

Observation $x$	Deviation from the Mean $x-\bar x$	Squared Deviation from the Mean $\left(x-\bar x\right)^2$
$38.1$	$38.1-38.8=-0.7$	$(-0.7)^2=0.49$
$38.3$	$38.3-38.8=-0.5$	$(-0.5)^2=0.25$
$38.4$	$38.4-38.8=-0.4$	$(-0.4)^2=0.16$
$39.5$	$39.5-38.8=0.7$	$(0.7)^2=0.49$
$39.7$	$39.7-38.8=0.9$	$(0.9)^2=0.81$
$\bar x = 38.8$	Sum $=0$	Sum $=2.20$

The sum of the squared deviations is 2.20.

Answer the following questions:

Suppose that the researchers had collected body temperature data on 500 bird flu patients instead of 5. What would happen to the sum of the squared deviations, if the distribution of the data is the same for the 500 patients as the 5 patients?

Solution

We would expect the sum of the squared deviations to be a lot larger than it is now. We would be adding squared deviations for 500 observations instead of 5. So, the sum of the squared deviations would be about 100 times larger.
Remember, we are trying to find a measure of the spread of a data set. Our final measure should not be dependent on the sample size. We need to do something else.

Please do not go on until you have finished this exercise.

Step 05: Recall that an average is adding a bunch of things up and dividing by the number of things. Consider taking the average of the squared deviations by adding them up and dividing by the number of deviations.

Unfortunately, this is what is technically known as a “biased” estimate. We don’t get into what that means in this class, but to correct for the bias, we divide by $n-1$ instead.

The number you computed in Step 05 is called the sample variance. It is a measure of the spread in a data set. It has very nice theoretical properties. The variance plays an important role in Statistics. We denote the sample variance by the symbol $s^2$.

It can be shown that the sample variance is an unbiased estimator of the true population variance (which is denoted $\sigma^2$.) This means that the sample variance can be considered a reasonable estimator of the population variance. If the sample size is large, this estimator tends to be very good.

Results from Step 5

The sum of the squared deviations is the sum of the values in Column 3. This sum equals 2.20. We divide the sum of Column 3 ($2.20$) by $n-1=5-1=4$ to get the sample variance, $s^2$:

\[s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55\]

This is the sample variance.

Observation $x$	Deviation from the Mean $x-\bar x$	Squared Deviation from the Mean $\left(x-\bar x\right)^2$
$38.1$	$38.1-38.8=-0.7$	$(-0.7)^2=0.49$
$38.3$	$38.3-38.8=-0.5$	$(-0.5)^2=0.25$
$38.4$	$38.4-38.8=-0.4$	$(-0.4)^2=0.16$
$39.5$	$39.5-38.8=0.7$	$(0.7)^2=0.49$
$39.7$	$39.7-38.8=0.9$	$(0.9)^2=0.81$
$\bar x = 38.8$	Sum $=0$	Sum $=2.20$
Variance:	$\displaystyle{s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55}$

Answer the following questions:

The temperature data for the bird flu patients are in degrees Centigrade. What are the units of the variance?

Solution

The data in Column 1 of the table is in degrees Centigrade. The mean also is in degrees Centigrade.
To get the numbers in Column 2, we subtracted the mean from each of the values in Column 1.
We squared the values in Column 2 to get Column 3. The units for this column are degrees Centigrade squared.
The sum of the numbers in Column 3 will also be in units of degrees Centigrade squared.
When we divided that sum by $n-1$, we obtained the sample variance. The sample variance has units of degrees Centigrade squared. This is not easily interpretable. It would be much easier to think about it if our measure of spread was in the same units as the data.

What operation can we do to the variance to get a quantity with units degrees Centigrade?

Solution

If we take the square root of the variance, we get a quantity that has units of degrees Centigrade. This quantity is the standard deviation.

Step 06: Take the square root of the sample variance to get the sample standard deviation.

The sample standard deviation is defined as the square root of the sample variance.

\[\text{Sample Standard Deviation} = s = \sqrt{ s^2 } = \sqrt{\strut\text{Sample Variance}}\]

The standard deviation has the same units as the original observations. We use the standard deviation heavily in statistics.

The sample standard deviation ($s$) is an estimate of the true population standard deviation ($\sigma$).

Answer the following questions:

What is the sample standard deviation, $s$, of the temperatures of the five patients with relatively uncomplicated cases of the bird flu?

Solution

The sum of the squared deviations is the sum of the values in Column 3. This sum equals 2.20. We divide the sum of Column 3 ($2.20$) by $n-1=5-1=4$ to get the sample variance, $s^2$:

$s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55$

This is the sample variance.

Observation $x$	Deviation from the Mean $x-\bar x$	Squared Deviation from the Mean $\left(x-\bar x\right)^2$
$38.1$	$38.1-38.8=-0.7$	$(-0.7)^2=0.49$
$38.3$	$38.3-38.8=-0.5$	$(-0.5)^2=0.25$
$38.4$	$38.4-38.8=-0.4$	$(-0.4)^2=0.16$
$39.5$	$39.5-38.8=0.7$	$(0.7)^2=0.49$
$39.7$	$39.7-38.8=0.9$	$(0.9)^2=0.81$
$\bar x = 38.8$	Sum $=0$	Sum $=2.20$
Variance:	$\displaystyle{s^2=\frac{sum}{n-1}=\frac{2.20}{5-1}=0.55}$
Standard Deviation:	$\displaystyle{s = \sqrt{s^2}=\sqrt{0.55} \approx 0.742}$

The sample standard deviation is $s = 0.742$ degrees Centigrade.
Take a few minutes to verify that you can recreate this table on your own.

Summary

Standard Deviation

The standard deviation is one number that describes the spread in a set of data. If the data points are close together the standard deviation will be smaller than if they are spread out.

At this point, it may be difficult to understand the meaning and usefulness of the standard deviation. For now, it is enough for you to recognize the following points:

The standard deviation is a measure of how spread out the data are.
If the standard deviation is large, then the data are very spread out.
If the standard deviation is zero, then all the values are the identical–there is no spread in the data.
The standard deviation cannot be negative.

Variance

The variance is the square of the standard deviation. The sample variance is denoted by the symbol $s^2$. You found the sample standard deviation for patient temperatures of uncomplicated cases of bird in the bird above is $s = 0.74162$. So, the sample variance for this data set is $s^2 = 0.74162^2 = 0.550$. Be aware, if you had squared the rounded value of $s^2 = 0.742$ in the calculation, you would have gotten an answer of 0.551 instead. This would be considered incorrect!

Rounding: Use unrounded values in interim calculations. Rounding too early in the process can lead to wrong answers.

R Instructions

To calculate the sample variance in R:

data <- c(38.1,38.3,38.4,39.5,39.7,39.1,39.5,38.9,39.2,39.9,39.7,39.0)

var(data)

[1] 0.3517424

The standard deviation and variance are two commonly used measures of the spread in a data set. Why is there more than one measure of the spread? The standard deviation and the variance each have their own pros and cons.

The variance has excellent theoretical properties. It is an unbiased estimator of the true population variance. That means that if many, many samples of $n$ observations were drawn, the variances computed for all the samples would be centered nicely around the true population variance, $\sigma^2$. Because of these benefits, the variance is regularly used in higher-level statistics applications. One drawback of the variance is that the units for the variance are the square of the units for the original data. In the bird flu example, the body temperatures were measured in degrees Centigrade. So, the variance will have units of degrees Centigrade squared $(^\circ \text{C})^2$. What does degrees Centigrade squared mean? How do you interpret this? It doesn’t make any sense. This is one of the major drawbacks of the sample variance.

Because we take the square root of the variance to get the standard deviation, the standard deviation is in the same units as the original data. This is a great advantage, and is one of the reasons that the standard deviation is commonly used to describe the spread of data.

Answer the following questions:

Enter the patient temperature data for the severe cases of bird flu into R Then use R to calculate the numerical summaries you have learned so far. As a reminder, the temperatures of patients with a severe case of bird flu are:

39.1, 39.5, 38.9, 39.2, 39.9, 39.7, 39

What is the mean, median, standard deviation and variance of the sample?

Solution

bird_flu <- c(39.1, 39.5, 38.9, 39.2, 39.9, 39.7, 39)

mean(bird_flu)

[1] 39.32857

median(bird_flu)

[1] 39.2

sd(bird_flu)

[1] 0.377334

var(bird_flu)

[1] 0.142381

For the next two questions, consider the histograms below comparing weight (in kilograms) of men (top histogram) to elephant seals (bottom histogram).

Weight of Men Compared to Weight of Seals
<img src/Images/DiveSealsVsMenWeights-Hist.png” width=80%>

Based on the histograms, who has a greater sample mean weight, men or elephant seals?

Solution

The mean is a measure of the center of a distribution. The mean weight of the men is less than the mean weight of the seals. We can see this because the bulk of the data in the histogram for the men’s weight is to the left of the seals’. The center of the distribution of elephant seals is about 195 kg. The center of the distribution of men’s weight is located below 100 kg on the number line.

Based on the histograms, do the weights of men or elephant seals have a larger sample standard deviation?

Solution

Standard deviation is a measure of spread. You will note that the weights of the seals are more spread out than the weights of the men. Therefore, we conclude that the sample standard deviation of elephant seal weights is larger than the sample standard deviation of men’s weights.

Review of Parameters and Statistics

We have now learned some statistics that can be used to estimate population parameters. For example, we use $\bar x$ to estimate the population mean $\mu$. The sample statistics $s$ estimates the true population standard deviation $\sigma$. The following table summarizes what we have done so far:

	Sample Statistic	Population Parameter
Mean	$\bar x$	$\mu$
Standard Deviation	$s$	$\sigma$
Variance	$s^2$	$\sigma^2$
$\vdots$	$\vdots$	$\vdots$

Unless otherwise specified, we will always use R to find the sample variance and sample mean. In each case, the sample statistic estimates the population parameter. The ellipses $\vdots$ in this table hint that we will add rows in the future.

Optional Reading: Formulas for $s$ and $s^2$ (Hidden)

Click here if you love math

Formulas

For those who like formulas, the equation for the sample variance and sample standard deviation are given here.

Sample variance:

\[\displaystyle{ s^2=\frac{\sum\limits_{i=1}^n (x_i-\bar x)^2}{n-1} }\]

Sample standard deviation:

\[\displaystyle{ s=\sqrt{s^2}=\sqrt{\frac{\sum\limits_{i=1}^n (x_i-\bar x)^2}{n-1}} }\]

where $x_i$ is the $i^{th}$ observed data value, and $i=1, 2, \ldots, n$.

Unless otherwise specified, we will always use Excel to find the sample variance and sample mean.

Why do we divide by $n-1$?

When computing the standard deviation or the variance, we are finding a value that describes the spread of data values. It is a measure of how far the data are from the mean. Since we do not know the true mean ($\mu$,) we use the sample mean ($\bar x$,) to estimate it. Typically, the data will be closer to $\bar x$ than to $\mu$, since $\bar x$ was computed using the data. To compensate for this, we divide by $n-1$ rather than $n$ when we find the “average” of the squared deviations from the mean. It turns out, that subtracting 1 from $n$ inflates this average by the precise amount needed to compensate for the use of $\bar x$ as an estimate for $\mu$. As a result, the sample variance ($s^2$) is a good estimator of the population variance ($\sigma^2$.)

Neither the standard deviation nor the variance is resistant to outliers. This means that when there are outliers in the data set, the standard deviation and the variance become artificially large. It is worth noting that the mean is also not resistant. When there are outliers, the mean will be “pulled” in the direction of the outliers.

The mean and standard deviation are used to describe the center and spread when the distribution of the data is symmetric and bell-shaped. If data are not symmetric and bell-shaped, we typically use the five-number summary (discussed below) to describe the spread, because this summary is resistant.

Additional Tools to Describe the Data

Recall the five steps of the Statistical Process (and the mnemonic “Daniel Can Discern More Truth).

Step 1:	Daniel	Design the study
Step 2:	Can	Collect data
Step 3:	Discern	Describe the data
Step 4:	More	Make inferences
Step 5:	Truth	Take action

Step 3 of this process is “Describe the data.” You have already learned about the mean, median, mode, standard deviation, variance and histograms. These can be good ways to describe the data. The following information on percentiles, quartiles, 5-number summaries, and boxplots will help you learn other common ways to describe data, especially if the data are skewed or contain outliers.

For symmetric, bell-shaped data, the mean and standard deviation provide a good description of the center and shape of the distribution. The mean and standard deviation are not sufficient to describe a distribution that is skewed or has outliers. An outlier is any observation that is very far from the others. The mean is pulled in the direction of the outlier. Also, the standard deviation is inflated by points that are very far from the mean.

Now, you have probably had some experience with percentiles in the past especially when you received a score on a standardized test such as the ACT. Even though percentiles are commonly used, they are generally misunderstood. Before examining the wrong site/wrong patient data, let’s review percentiles. Even if you think you understand percentiles, please study this section carefully.

Percentiles and Quartiles

Imagine a very long street with houses on one side. The houses increase in value from left to right. At the left end of the street is a small cardboard box with a leaky roof. Next door is a slightly larger cardboard box that does not leak. The houses eventually get larger and more valuable. The rightmost house on the street is a huge mansion.

Answer the following question:

There are 100 homes with increasing property values. How many fences are needed to separate the 100 properties?

Solution

In order to separate the 100 homes, 99 fences are required.

The home values are representative of data. If we have a list of data, sorted in increasing order, and we want to divide it into 100 equal groups, we only need 99 dividers (like fences) to divide up the data. The first divider is as large or larger than 1% of the data. The second divider is as large or larger than 2% of the data, and so on. The last divider, the 99^th, is the value that is as large or larger than 99% of the data. These dividers are called percentiles. A percentile is a number such that a specified percentage of the data are at or below this number. For example, the 99^th percentile is a number such that 99% of the data are at or below this value. As another example, half (50%) of the data lie at or below the 50^th percentile. The word percent means $\div 100$. This can help you remember that the percentiles divide the data into 100 equal groups.

Quartiles are special percentiles. The word quartile is from the Latin quartus, which means “fourth.” The quartiles divide the data into four equal groups. The quartiles correspond to specific percentiles. The first quartile, Q₁, is the 25^th percentile. The second quartile, Q₂, is the same as the 50^th percentile or the median. The third quartile, Q₃, is equivalent to the 75^th percentile.

Answer the following questions:

How many quartiles are there?

Solution

There are 3 quartiles! To divide the data into 100 equal groups, we needed 99 percentiles. To divide the data into 4 equal groups, we need 3 quartiles.

Wrong Site/Wrong Patient Lawsuits

Percentiles can be used to describe the center and spread of any distribution and are particularly useful when the distribution is skewed or has outliers. To explore this issue, you will use software to calculate percentiles of data on costs incurred by hospitals due to certain lawsuits. The lawsuits in question were about surgeries performed on the wrong patient, or on the right patient but the wrong part of the patient’s body (the wrong site).

But first, we need to learn how to load data into R.

R has many built-in toolboxes. R also has a vast array of toolboxes beyond the built-in ones that we must first install. This is like going to the Home Depot to buy a specialized toolbox and then storing it in your garage. We only have to “buy” it once.

To install a library, we use the install.packages("") command, where we specify the library we want in the quotes inside the parentheses.

rio is a toolbox that is very useful for loading data into R. If you haven’t already done so, install the rio library.

install.packages('rio')

While you only have to install libraries once, you have to load them every time you want to use one. It’s like going to the garage to get the toolbox you need for the job.

Now let’s load the data and calculate some percentiles!

R Instructions

Open R and load the rio library:

library(rio)

Use the import() function to load the dataset:

wrong_site <- import("https://github.com/byuistats/Math221D_Course/raw/main/Data/WrongSiteWrongPatient.xlsx")

To calculate percentiles and quartiles in R, do the following

Datasets loaded into R may have many columns of information. To specify which column in the dataset should be used for analysis we use the $. For example, if we wanted only to look at the Wrong_Patient clumn in the wrong_site dataset:

  [1]  250000  106900   62307  192800   20769    2680    4300   30819   23214
 [10]   26099       0   50000   66600  175000   10384   42900   52928       0
 [19]    8200    2500    6900  126300     900    7700  140000   76000   50000
 [28]  354530    5359    4300   12000   16749   35600    9045   21900    2010
 [37]   22444   50000   85000   40370   39863       0   36100   49000   48908
 [46]   19800   32200    3400       0   75000   21774    2600   30000    7300
 [55]  176940   55000    9500   55272    4690   75000   34168   83700    1005
 [64]   17419   34800   14739       0       0    1000     325   41538  108200
 [73]   63224   15000       0    3900   65657   50000  109205    3900   10000
 [82]    9900   87096   12090       0    1000       0   74701    3900   18000
 [91]       0   33499    1250       0   29813   11724  141363    3685   35508
[100]    2500   12060    5695   50582   82071   55400       0  104400     500
[109]       0   25000   10000   85000   25000       0   24100    3900 1250000
[118]   15074     550    7195  101800   11600    1000    4020   19764   25794
[127]     900   10000   35200   94100       0   16909  128400   60967   50000
[136]   50000   84751   46800  130308   43800   49242   22800   15500   11054
[145]     400   10000  104790   13064    6400  100000   17084   16300   11000
[154]   12500       0    1200       0  200000    3900    3015  172200   25000
[163]   27468  250000   21104   12500   30000   59000   46227     500  131000
[172]    2345    6000       0     670    9714      NA      NA      NA      NA
[181]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[190]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[199]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[208]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[217]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[226]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[235]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[244]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[253]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[262]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[271]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[280]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[289]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[298]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[307]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[316]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[325]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[334]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[343]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[352]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[361]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[370]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[379]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[388]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[397]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[406]      NA      NA      NA      NA      NA      NA

Use R’s quantile() function. This functions requires two inputs separated by a comma: the data and the desired percentile input as a decimal.
To calculate the 25th percentile for the costs of surgery done on the Wrong Site:

quantile(wrong_site$Wrong_Site, .25, na.rm=TRUE)

  25% 
29496

# Note: the na.rm=TRUE removes the missing values from the dataset

The first quartile ($Q_1$) or 25^th percentile (calculated in R) of the wrong-site data is: $29,496. (This result is illustrated in the figure below.) This means that 25 percent of the time hospitals lost a wrong-site lawsuit, they had to pay $29,496 or less. The 25^th percentile can be written symbolically as: P₂₅ = $29,496. Other percentiles can be written the same way. The 99^th percentile can be written as P₉₉.

Percentiles (Calculated in Excel)


1st percentile	0
2nd percentile	0
3rd percentile	0
…	…
24th percentile	28633.4
25th percentile	29496
26th percentile	31067

Answer the following questions:

What is the 13^th percentile of the wrong site data?

Solution

$6343.40

quantile(wrong_site$Wrong_Site, .13, na.rm=TRUE)

   13% 
6343.4

How would you interpret the 13^th percentile (assuming the 13^th percentile is $6343.40)?

1. 100 of the lawsuits cost more than 13%.
1. 13% of the lawsuits cost the hospital over $6343.40.
1. In 13% of the wrong-site lawsuits, hospitals had to pay $6343.40 or less.
1. For 13% of the wrong-site lawsuits, the hospitals had to pay $6343.40 to the patient.

Solution

Correct Answer: C

Find P₉₀.

Solution

$149,963.00

quantile(wrong_site$Wrong_Site, .9, na.rm=TRUE)

   90% 
149963

The quartiles divide a sorted list of data into four equal groups. So, each group contains 25% of the data. The first quartile is the value that is greater than or equal to 25% of the data. What is another name for this number?

Solution

The 25th percentile.

What is the value of the third quartile?

Solution

$124,280.00

quantile(wrong_site$Wrong_Site, .75, na.rm=TRUE)

   75% 
124280

Half of the wrong-site lawsuits judgments were less than or equal to what value?

Solution

$68,552.00

quantile(wrong_site$Wrong_Site, .5, na.rm=TRUE)

  50% 
68552

#Or
median(wrong_site$Wrong_Site)

[1] 68552

The median is the middle observation in a sorted list of data. What percentile is always equal to the median?

Solution

The 50th percentile

The Five-Number Summary

Another way to summarize data is with the five-number summary. The five-number summary is comprised of the minimum, the first quartile, the second quartile (or median), the third quartile, and the maximum.

There is a very easy way to get the Five-Number Summary along with the mean and standard deviation. The favstats() function in the mosaic library gives us all of our favorite statistics.

As before, we will have to install the mosaic library once, then load it when we want to use it.

R Instructions

To find the values for a five-number summary in R, do the following

Install the mosaic library (Only Once):

install.packages("mosaic")

Load the Library:

library(mosaic)

Input the data into the favstats() function:

favstats(wrong_site$Wrong_Site)

 min    Q1 median     Q3    max     mean       sd   n missing
   0 29496  68552 124280 780575 80041.24 71403.83 411       0

Answer the following questions:

Give the five-number summary for the Wrong Site data.

Solution
\[\displaystyle{\$0,~~\$29,496;~~\$68,552;~~\$124,280;~~\$780,575}\]

Some students mistakenly include the mean in the five-number summary. The third value in the five-number summary is the median.

Boxplots

A boxplot is a graphical representation of the five-number summary. Unlike the mean or standard deviation, a boxplot is resistant to outliers. That means that it won’t be “pulled” one way or the other by extraordinarily large or small values in the data as will a mean, for instance. We will illustrate the process of making a boxplot using the wrong-site data.

Follow the steps below to learn how a boxplot relates to the five-number summary. Learning what each part of the boxplot represents will enable you to interpret the plot correctly.

Step 01: To draw a boxplot, start with a number line.

Step 02: Draw a vertical line segment above each of the quartiles.

Step 03: Connect the tops and bottoms of the line segments, making a box.

Step 04: Make a smaller mark above the values corresponding to the minimum and the maximum.

Step 05: Draw a line from the left side of the box to the minimum, and draw another line from the right side of the box the maximum.

Step 06: These last two lines look like whiskers, so this is sometimes called a box-and-whisker plot.

R Instructions

To create a boxplot in Excel, do the following

Load the data file. For this example, open the file WrongSiteWrongPatient.xlsx.

wrong_site <-  import("https://github.com/byuistats/Math221D_Course/raw/main/Data/WrongSiteWrongPatient.xlsx")

Use the boxplot() function to get a boxplot:

boxplot(wrong_site$Wrong_Site)

# We can make it a little nicer by adding labels to the x and y axes and adding a title as follows:

boxplot(wrong_site$Wrong_Site, xlab="Wrong Site", ylab="Cost in $", main="Boxplot of Costs of Operating on the Wrong Site")

Answer the following questions:

Create a histogram of the wrong-patient lawsuit data, located in column B of the file WrongSiteWrongPatient.xlsx. What is the shape of the wrong-patient data?

Skewed left
Symmetric
Skewed right
Multi-modal
Uniform

Solution

To create the histogram, use the histogram() function on the data:

histogram(wrong_site$Wrong_Patient)

From the histogram we clearly see most values bunched near the left and gradually fewer values as we move to the right along the number line, so the correct answer is ‘c. Skewed right’.

	Sample Statistic	Population Parameter
Mean	\(\bar x\)	\(\mu\)
Standard Deviation	\(s\)	\(\sigma\)
Variance	\(s^2\)	\(\sigma^2\)
\(\vdots\)	\(\vdots\)	\(\vdots\)