Numerical Summaries

Back to Describing Data

There are many ways to numerically summarize data. The fundamental idea is to describe the center, or most probable values of the data, as well as the spread, or the possible values of the data.

Mean

$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} $$

Measure of Center | 1 Quantitative Variable

The “balance point” of the data. The numerical sum of the values divided by the number of values. Typically used in tandem with the standard deviation to describe relatively symmetrically distributed data. Influenced by outliers.

R Instructions

Median

$$ \frac{x_{(n/2)}+x_{(n/2+1)}}{2} $$
↑ even n odd ↓
x_{((n + 1)/2)}

Measure of Center | 1 Quantitative Variable

The “middle data point,” i.e., the 50^th percentile. Half of the data is below the median and half is above the median. Typically used in tandem with the five-number summary to describe either symmetric or skewed data. Not heavily influenced by outliers, i.e., robust.

R Instructions

Mode

Most

Frequent

Value

Measure of Center / 1 Quantitative or Qualitative Variable

The most commonly occurring value. There may be more than one mode. Seldom used, but sometimes useful.

R Instructions

Minimum

x₍₁₎

Measure of Spread | 1 Quantitative Variable

The smallest occurring data value. One of the numerical summaries in the five-number summary. Typically not useful on its own. Gives a good feel for the spread in the left tail of the distribution when used with the five-number summary.

R Instructions

Maximum

x_(n)

Measure of Spread | 1 Quantitative Variable

The largest occurring data value. One of the numerical summaries in the five-number summary. Typically not useful on its own. Gives a good feel for the spread in the right tail of the distribution when used in the five-number summary.

R Instructions

Quartiles (five-number summary)

25^th, 50^th, 75^th

and 100^th

Percentiles

Measure of Center & Spread | 1 Quantitative Variable

Good for describing the spread of data, typically for skewed distributions. There are four quartiles. They make up the five-number summary when combined with the minimum. The second quartile is the median (50^th percentile) and the fourth quartile is the maximum (100^th percentile). The first quartile (Q₁ or lower quartile) and third quartile (Q₃ or upper quartile) show the spread of the “middle 50%” of the data, which is often called the interquartile range. Comparing the interquartile range to the minimum and maximum shows how the possible values spread out around the more probable values.

R Instructions

Standard Deviation

$s = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}$

Measure of Spread | 1 Quantitative Variable

Measures how spread out the data are from the mean. It is never negative and typically not zero. Larger values mean the data is highly variable. Smaller values mean the data is consistent and not as variable. It is typically used with the mean to describe relatively symmetric data. The order of operations in the formula is important and for this reason it is sometimes called the “root mean squared error,” though the calculations are performed in reverse of that. (Study the formula on the left to understand.) The denominator n − 1 is called the degrees of freedom.

R Instructions

Variance

$s^2 = \frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}$

Measure of Spread | 1 Quantitative Variable

Great theoretical properties, but seldom used when describing data. Difficult to interpret in context of data because it is in squared units. The standard deviation is typically used instead because it is in the original units and is thus easier to interpret.

R Instructions

Range

max − min x_(n) − x₍₁₎

Measure of Spread | 1 Quantitative Variable

The difference between the maximum and minimum values. A general rule of thumb is that the range divided by four is roughly the standard deviation. Quick to obtain, but not as good as using the standard deviation. Was used more frequently before the advent of modern calculators.

R Instructions

Percentile

←To the Left

Measure of Location | 1 Quantitative Variable

The percent of data that is equal to or less than a given data point. Useful for describing the relative position of a data point within a data set. If the percentile is close to 100, then the observation is one of the largest. If it is close to zero, then the observation is one of the smallest.

R Instructions

Proportion

$\hat{p}=\frac{x}{n}$

Measure of Center | 1 Qualitative Variable

The percent of observations in the data that satisfy some requirement. Obtained by dividing the number of successes x by the number of total observations n. Often referred to as a percentage.

R Instructions

Correlation

$r = \frac{\textstyle\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1}$

Measure of Association | 2 Quantitative Variables

Describes the strength and direction of the association between two quantitative variables. Restricted to values between -1 and 1. A value of zero denotes no association between the two variables. A value of 1 or -1 implies a perfect positive or perfect negative association, respectively.

R Instructions