Quantitative Data Summaries - Multiple Groups

Introduction

In this section, we will demonstrate how to numerically and visually summarize a quantitative response variable for each level of a categorical explanatory variable.

Lesson Outcomes

  1. Create a table of summary statistics (favstats()) for multiple groups
  2. Create side-by-side boxplots comparing multiple groups
  3. Interpret side-by-side boxplots for group comparisons

Load the data and libraries

We will use the Big 5 Personality data of a random sample of Brother Cannon’s students.

library(rio)
library(mosaic)
library(tidyverse)
library(car)

big5 <- import('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/All_class_combined_personality_data.csv')

Summarizing a Quantitative Variable for Multiple Categories

Sometimes we would like to compare summary statistics between groups. Much of this class will be about how to make formal, rigorous comparisons between groups. But for now, let’s look at how to get different summaries of quantitative variables for multiple categories.

Summary Statistics

We can easily extend favstats() to output our favorite statistics for multiple groups.

We first must identify the quantitative response variable we want to compare, then tell R which categorical explanatory variable we would like to compare.

For example, we could compare agreeableness between the sexes. In this case, Agreeableness is the quantitative response variable and Sex(M/F) is the categorical explanatory variable.

# This gives us the summary statistics for Agreeableness across all groups
favstats(big5$Agreeableness)
 min Q1 median Q3 max     mean       sd   n missing
  21 67     75 81 100 73.43457 13.24909 405       0
# Adding the '~' tells R to break the data into groups (determined by the right side of the '~') and calculate the means of the variable on the left
favstats(big5$Agreeableness ~ big5$`Sex(M/F)`)
  big5$`Sex(M/F)` min Q1 median Q3 max     mean       sd   n missing
1               F  21 69     77 85 100 75.92035 12.94640 226       0
2               M  25 63     73 79  94 70.29609 12.99218 179       0

Visual Summaries by Group

We can use the exact same formula used for boxplot() as we used for favstats():

boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`)

NOTE: We will use the formula data$response ~ data$explanatory for LOTS of functions this semester. They will always take the form y ~ x.

Improving Graphs

Throughout this course, we will ease into making better visualizations. For now, here are some basic techniques that will usually apply to all graphing functions in R:

# Changing color by sepecifying the `col = c()`
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, col = c("red", "blue"))

# R also assigns a numerical value to `col = `.  Try different numbers
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, col = c(2,3))

boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, col = c(4,6))

# Adding better axis labels using `xlab = ` and `ylab = `:
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, xlab = "Biosex", ylab = "Trait Agreeableness")

# Adding a title:
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, main = "Comparing Agreeableness by Biosex")

# Putting it all together:
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`,main = "Comparing Agreeableness by Biosex", xlab = "Biosex", ylab = "Trait Agreeableness", col = c(3, 4))

Your Turn

Create summary statistics for Conscientiousness based on course section:

Create side-by-side boxplot for Conscientiousness based on course section: