There are many ways to numerically summarize data. The fundamental idea is to describe the center, or most probable values of the data, as well as the spread, or the possible values of the data.
$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} $$ |
Measure of Center | 1 Quantitative Variable The “balance point” of the data. The numerical sum of the values divided by the number of values. Typically used in tandem with the standard deviation to describe relatively symmetrically distributed data. Influenced by outliers. R Instructions |
$$ \frac{x_{(n/2)}+x_{(n/2+1)}}{2} $$ ↑ even n odd ↓ x((n + 1)/2) |
Measure of Center | 1 Quantitative Variable The “middle data point,” i.e., the 50th percentile. Half of the data is below the median and half is above the median. Typically used in tandem with the five-number summary to describe either symmetric or skewed data. Not heavily influenced by outliers, i.e., robust. R Instructions |
Most Frequent Value |
Measure of Center / 1 Quantitative or Qualitative Variable The most commonly occurring value. There may be more than one mode. Seldom used, but sometimes useful. R Instructions |
x(1) |
Measure of Spread | 1 Quantitative Variable The smallest occurring data value. One of the numerical summaries in the five-number summary. Typically not useful on its own. Gives a good feel for the spread in the left tail of the distribution when used with the five-number summary. |
x(n) |
Measure of Spread | 1 Quantitative Variable The largest occurring data value. One of the numerical summaries in the five-number summary. Typically not useful on its own. Gives a good feel for the spread in the right tail of the distribution when used in the five-number summary. |
25th, 50th, 75th and 100th Percentiles |
Measure of Center & Spread | 1 Quantitative Variable Good for describing the spread of data, typically for skewed distributions. There are four quartiles. They make up the five-number summary when combined with the minimum. The second quartile is the median (50th percentile) and the fourth quartile is the maximum (100th percentile). The first quartile (Q1 or lower quartile) and third quartile (Q3 or upper quartile) show the spread of the “middle 50%” of the data, which is often called the interquartile range. Comparing the interquartile range to the minimum and maximum shows how the possible values spread out around the more probable values. R Instructions |
$s = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}$ |
Measure of Spread | 1 Quantitative Variable Measures how spread out the data are from the mean. It is never negative and typically not zero. Larger values mean the data is highly variable. Smaller values mean the data is consistent and not as variable. It is typically used with the mean to describe relatively symmetric data. The order of operations in the formula is important and for this reason it is sometimes called the “root mean squared error,” though the calculations are performed in reverse of that. (Study the formula on the left to understand.) The denominator n − 1 is called the degrees of freedom. R Instructions |
$s^2 = \frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}$ |
Measure of Spread | 1 Quantitative Variable Great theoretical properties, but seldom used when describing data. Difficult to interpret in context of data because it is in squared units. The standard deviation is typically used instead because it is in the original units and is thus easier to interpret. R Instructions |
max − min x(n) − x(1) |
Measure of Spread | 1 Quantitative Variable The difference between the maximum and minimum values. A general rule of thumb is that the range divided by four is roughly the standard deviation. Quick to obtain, but not as good as using the standard deviation. Was used more frequently before the advent of modern calculators. R Instructions |
←To the Left |
Measure of Location | 1 Quantitative Variable The percent of data that is equal to or less than a given data point. Useful for describing the relative position of a data point within a data set. If the percentile is close to 100, then the observation is one of the largest. If it is close to zero, then the observation is one of the smallest. R Instructions |
$\hat{p}=\frac{x}{n}$ |
Measure of Center | 1 Qualitative Variable The percent of observations in the data that satisfy some requirement. Obtained by dividing the number of successes x by the number of total observations n. Often referred to as a percentage. R Instructions |
$r = \frac{\textstyle\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1}$ |
Measure of Association | 2 Quantitative Variables Describes the strength and direction of the association between two quantitative variables. Restricted to values between -1 and 1. A value of zero denotes no association between the two variables. A value of 1 or -1 implies a perfect positive or perfect negative association, respectively. R Instructions |