group_by() + summarise()

Summarizing Data

Consider the High School survey data with 60 columns and 312 respondents.

# Load libraries and data

library(rio)
library(mosaic)
library(tidyverse)
library(car)

survey <- import('https://github.com/byuistats/Math221D_Cannon/raw/master/Data/HighSchoolSeniors_subset.csv') %>% tibble()

The mosaic library is great for numerical summaries of quantitative variables using the favstats() function. We can create tables of the 5 number summary, mean, standard deviation, sample size, and number of missing values with one line of code:

favstats(survey$Height_cm)

  min  Q1 median      Q3 max     mean       sd   n missing
 1.68 161    170 178.125 999 169.2412 53.54382 312       0

We can add a grouping variable to get the same summary for each level of a group, using ~

favstats(survey$Height_cm ~ survey$Gender)

  survey$Gender  min  Q1 median     Q3   max     mean       sd   n missing
1        Female 5.50 160  162.5 167.15 182.8 158.6461 24.53056 152       0
2          Male 1.68 172  177.9 182.80 999.0 179.3065 69.47610 160       0

This works great if you want to do one response/dependent variable at a time. But we often want specific summaries of data (often by groups) of more than one variable in the dataset.

We can use a combination of tidyverse functions, group_by() and summarise() to create custom summary tables.

The group_by() signals to R that whatever follows should be done for each level of the column(s) identified inside the parentheses. We can then “pipe” the grouped dataset into a summarize function and define what summary statistics we would like. summarise() works very much like the mutate() function in that we create a name for our summary and tell R how to make it.

EXAMPLE: Let’s calculate the means for Height_cm, Reaction_time, and Social_Websites_Hours for Males and Females:

clean <- survey %>%
  group_by(Gender) %>%
  summarise(
    mean_height = mean(Height_cm, na.rm=TRUE),
    mean_react_time = mean(Reaction_time, na.rm=TRUE),
    mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE)
  )

clean

# A tibble: 2 × 4
  Gender mean_height mean_react_time mean_social_media_hrs
  <chr>        <dbl>           <dbl>                 <dbl>
1 Female        159.           0.717                  14.4
2 Male          179.           0.639                  14.0

Pro Tip: Recall that the mean() function returns “NA” when there are missing values in the data. Adding na.rm=TRUE to your functions will make sure that you get a mean value.

EXAMPLE: Let’s do the same means but for handedness:

clean <- survey %>%
  group_by(Handed) %>%
  summarise(
    mean_height = mean(Height_cm, na.rm=TRUE),
    mean_react_time = mean(Reaction_time, na.rm=TRUE),
    mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE)
  )

clean

# A tibble: 3 × 4
  Handed       mean_height mean_react_time mean_social_media_hrs
  <chr>              <dbl>           <dbl>                 <dbl>
1 Ambidextrous        255.           0.352                  20.2
2 Left-Handed         164.           1.45                   14.5
3 Right-Handed        167.           0.583                  13.9

Combining Tidy Functions

Click to see how to filter outliers for reaction times (reaction times greater than 1 second), and height outliers (taller than 7 feet tall), and social media hours (more than 100 hours).

Click to see

clean <- survey %>%
  filter(Height_cm < 214,
         Reaction_time < 1,
         Social_Websites_Hours < 100) %>%
  select(Gender, Height_cm, Reaction_time, Social_Websites_Hours)

# Pipe the new clean dataset into the group_by() and summarise() as above:

clean %>%
  group_by(Gender) %>%
  summarise(
    mean_height = mean(Height_cm, na.rm=TRUE),
    sd_ht = sd(Height_cm, na.rm=TRUE),
    mean_react_time = mean(Reaction_time, na.rm=TRUE),
    sd_react_time = sd(Reaction_time, na.rm=TRUE),
    mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE),
    sd_social_hrs = sd(Social_Websites_Hours, na.rm=TRUE)
  )

# A tibble: 2 × 7
  Gender mean_height sd_ht mean_react_time sd_react_time mean_social_media_hrs
  <chr>        <dbl> <dbl>           <dbl>         <dbl>                 <dbl>
1 Female        158.  25.1           0.444         0.134                  12.9
2 Male          174.  23.4           0.395         0.139                  12.5
# ℹ 1 more variable: sd_social_hrs <dbl>

Grouping by Multiple Variables

It is simple to get summary statistics for multiple grouping factors.

EXAMPLE: Suppose we want the same means calculated above, but for gender and handedness:

clean <- survey %>%
  group_by(Gender, Handed) %>%
  summarise(
    mean_height = mean(Height_cm, na.rm=TRUE),
    mean_react_time = mean(Reaction_time, na.rm=TRUE),
    mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE)
  )

clean

# A tibble: 6 × 5
# Groups:   Gender [2]
  Gender Handed       mean_height mean_react_time mean_social_media_hrs
  <chr>  <chr>              <dbl>           <dbl>                 <dbl>
1 Female Ambidextrous        134.           0.361                  27  
2 Female Left-Handed         160.           0.524                  12.8
3 Female Right-Handed        159.           0.750                  14.3
4 Male   Ambidextrous        315.           0.348                  16.8
5 Male   Left-Handed         167.           2.29                   16.1
6 Male   Right-Handed        175.           0.420                  13.5

I can also use the n() function without any inputs to count the number of observations in each group:

clean <- survey %>%
  group_by(Gender, Handed) %>%
  summarise(
    mean_height = mean(Height_cm, na.rm=TRUE),
    mean_react_time = mean(Reaction_time, na.rm=TRUE),
    mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE),
    N = n()
  )

clean

# A tibble: 6 × 6
# Groups:   Gender [2]
  Gender Handed       mean_height mean_react_time mean_social_media_hrs     N
  <chr>  <chr>              <dbl>           <dbl>                 <dbl> <int>
1 Female Ambidextrous        134.           0.361                  27       3
2 Female Left-Handed         160.           0.524                  12.8    17
3 Female Right-Handed        159.           0.750                  14.3   132
4 Male   Ambidextrous        315.           0.348                  16.8     6
5 Male   Left-Handed         167.           2.29                   16.1    19
6 Male   Right-Handed        175.           0.420                  13.5   135

This shows me that there are only 3 female ambidextrous students in the sample and 6 male ambidextrous students.