Summarizing Data

Multiple Groups

Introduction

In this document, we will demonstrate how to summarize quantitative data for multiple groups in a dataset.

Load the data and libraries

library(rio)
library(mosaic)
library(tidyverse)
library(car)

big5 <- import('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/All_class_combined_personality_data.csv')

Summarizing a Quantitative Variable for Multiple Categories

Sometimes we would like to compare summary statistics between groups. Much of this class will be about how to make formal, rigorous comparisons between groups. But for now, let’s look at how to get different summaries of quantitative variables for multiple categories.

Summary Statistics

We can easily extend favstats() to output our favorite statistics for multiple groups. We first must identify the quantitative factor we want to compare. For example, we could compare agreeableness between the sexes.

# This gives us the summary statistics for Agreeableness across all groups
favstats(big5$Agreeableness)

 min Q1 median Q3 max     mean       sd   n missing
  21 67     75 81 100 73.43457 13.24909 405       0

# Adding the '~' tells R to break the data into groups (determined by the right side of the '~') and calculate the means of the variable on the left
favstats(big5$Agreeableness ~ big5$`Sex(M/F)`)

  big5$`Sex(M/F)` min Q1 median Q3 max     mean       sd   n missing
1               F  21 69     77 85 100 75.92035 12.94640 226       0
2               M  25 63     73 79  94 70.29609 12.99218 179       0

Visual Summaries by Group

We can use the exact same format as we used for favstats() for boxplot():

boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`)

Improving Graphs

Throughout this course, we will ease into making better visualizations. For now, here are some basic techniques that will usually apply to all graphing functions in R:

# Changing color by sepecifying the `col = c()`
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, col = c("red", "blue"))

# R also assigns a numerical value to `col = `.  Try different numbers
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, col = c(2,3))

boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, col = c(4,6))

# Adding better axis labels using `xlab = ` and `ylab = `:
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, xlab = "Biosex", ylab = "Trait Agreeableness")

# Adding a title:
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`, main = "Comparing Agreeableness by Biosex")

# Putting it all together:
boxplot(big5$Agreeableness ~ big5$`Sex(M/F)`,main = "Comparing Agreeableness by Biosex", xlab = "Biosex", ylab = "Trait Agreeableness", col = c(3, 4))

Your Turn

Create summary statistics for Conscientiousness based on course section:

Create side-by-side boxplot for Conscientiousness based on course section: