Back to Describing Data

There are many ways to numerically summarize data. The fundamental idea is to describe the center, or most probable values of the data, as well as the spread, or the possible values of the data.


Mean


$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} $$

Measure of Center | 1 Quantitative Variable

The “balance point” of the data. The numerical sum of the values divided by the number of values. Typically used in tandem with the standard deviation to describe relatively symmetrically distributed data. Influenced by outliers.

R Instructions


Median


$$ \frac{x_{(n/2)}+x_{(n/2+1)}}{2} $$
even n odd
x((n + 1)/2)

Measure of Center | 1 Quantitative Variable

The “middle data point,” i.e., the 50th percentile. Half of the data is below the median and half is above the median. Typically used in tandem with the five-number summary to describe either symmetric or skewed data. Not heavily influenced by outliers, i.e., robust.

R Instructions


Mode

Most

Frequent

Value

Measure of Center / 1 Quantitative or Qualitative Variable

The most commonly occurring value. There may be more than one mode. Seldom used, but sometimes useful.

R Instructions


Minimum


x(1)

Measure of Spread | 1 Quantitative Variable

The smallest occurring data value. One of the numerical summaries in the five-number summary. Typically not useful on its own. Gives a good feel for the spread in the left tail of the distribution when used with the five-number summary.

R Instructions



Maximum


x(n)

Measure of Spread | 1 Quantitative Variable

The largest occurring data value. One of the numerical summaries in the five-number summary. Typically not useful on its own. Gives a good feel for the spread in the right tail of the distribution when used in the five-number summary.

R Instructions



Quartiles (five-number summary)

25th, 50th, 75th

and 100th

Percentiles

Measure of Center & Spread | 1 Quantitative Variable

Good for describing the spread of data, typically for skewed distributions. There are four quartiles. They make up the five-number summary when combined with the minimum. The second quartile is the median (50th percentile) and the fourth quartile is the maximum (100th percentile). The first quartile (Q1 or lower quartile) and third quartile (Q3 or upper quartile) show the spread of the “middle 50%” of the data, which is often called the interquartile range. Comparing the interquartile range to the minimum and maximum shows how the possible values spread out around the more probable values.

R Instructions


Standard Deviation

$s = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}$

Measure of Spread | 1 Quantitative Variable

Measures how spread out the data are from the mean. It is never negative and typically not zero. Larger values mean the data is highly variable. Smaller values mean the data is consistent and not as variable. It is typically used with the mean to describe relatively symmetric data. The order of operations in the formula is important and for this reason it is sometimes called the “root mean squared error,” though the calculations are performed in reverse of that. (Study the formula on the left to understand.) The denominator n − 1 is called the degrees of freedom.

R Instructions


Variance

$s^2 = \frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}$

Measure of Spread | 1 Quantitative Variable

Great theoretical properties, but seldom used when describing data. Difficult to interpret in context of data because it is in squared units. The standard deviation is typically used instead because it is in the original units and is thus easier to interpret.

R Instructions


Range

max min x(n) − x(1)

Measure of Spread | 1 Quantitative Variable

The difference between the maximum and minimum values. A general rule of thumb is that the range divided by four is roughly the standard deviation. Quick to obtain, but not as good as using the standard deviation. Was used more frequently before the advent of modern calculators.

R Instructions


Percentile

To the Left

Measure of Location | 1 Quantitative Variable

The percent of data that is equal to or less than a given data point. Useful for describing the relative position of a data point within a data set. If the percentile is close to 100, then the observation is one of the largest. If it is close to zero, then the observation is one of the smallest.

R Instructions


Proportion

$\hat{p}=\frac{x}{n}$

Measure of Center | 1 Qualitative Variable

The percent of observations in the data that satisfy some requirement. Obtained by dividing the number of successes x by the number of total observations n. Often referred to as a percentage.

R Instructions


Correlation

$r = \frac{\textstyle\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1}$

Measure of Association | 2 Quantitative Variables

Describes the strength and direction of the association between two quantitative variables. Restricted to values between -1 and 1. A value of zero denotes no association between the two variables. A value of 1 or -1 implies a perfect positive or perfect negative association, respectively.

R Instructions