# Load libraries and data
library(rio)
library(mosaic)
library(tidyverse)
library(car)
<- import('https://github.com/byuistats/Math221D_Cannon/raw/master/Data/HighSchoolSeniors_subset.csv') %>% tibble() survey
group_by() + summarise()
Summarizing Data
Consider the High School survey data with 60 columns and 312 respondents.
The mosaic
library is great for numerical summaries of quantitative variables using the favstats()
function. We can create tables of the 5 number summary, mean, standard deviation, sample size, and number of missing values with one line of code:
favstats(survey$Height_cm)
min Q1 median Q3 max mean sd n missing
1.68 161 170 178.125 999 169.2412 53.54382 312 0
We can add a grouping variable to get the same summary for each level of a group, using ~
favstats(survey$Height_cm ~ survey$Gender)
survey$Gender min Q1 median Q3 max mean sd n missing
1 Female 5.50 160 162.5 167.15 182.8 158.6461 24.53056 152 0
2 Male 1.68 172 177.9 182.80 999.0 179.3065 69.47610 160 0
This works great if you want to do one response/dependent variable at a time. But we often want specific summaries of data (often by groups) of more than one variable in the dataset.
We can use a combination of tidyverse functions, group_by()
and summarise()
to create custom summary tables.
The group_by()
signals to R that whatever follows should be done for each level of the column(s) identified inside the parentheses. We can then “pipe” the grouped dataset into a summarize function and define what summary statistics we would like. summarise()
works very much like the mutate() function in that we create a name for our summary and tell R how to make it.
EXAMPLE: Let’s calculate the means for Height_cm, Reaction_time, and Social_Websites_Hours for Males and Females:
<- survey %>%
clean group_by(Gender) %>%
summarise(
mean_height = mean(Height_cm, na.rm=TRUE),
mean_react_time = mean(Reaction_time, na.rm=TRUE),
mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE)
)
clean
# A tibble: 2 × 4
Gender mean_height mean_react_time mean_social_media_hrs
<chr> <dbl> <dbl> <dbl>
1 Female 159. 0.717 14.4
2 Male 179. 0.639 14.0
Pro Tip: Recall that the mean()
function returns “NA” when there are missing values in the data. Adding na.rm=TRUE
to your functions will make sure that you get a mean value.
EXAMPLE: Let’s do the same means but for handedness:
<- survey %>%
clean group_by(Handed) %>%
summarise(
mean_height = mean(Height_cm, na.rm=TRUE),
mean_react_time = mean(Reaction_time, na.rm=TRUE),
mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE)
)
clean
# A tibble: 3 × 4
Handed mean_height mean_react_time mean_social_media_hrs
<chr> <dbl> <dbl> <dbl>
1 Ambidextrous 255. 0.352 20.2
2 Left-Handed 164. 1.45 14.5
3 Right-Handed 167. 0.583 13.9
- Click to see how to filter outliers for reaction times (reaction times greater than 1 second), and height outliers (taller than 7 feet tall), and social media hours (more than 100 hours).
Click to see
<- survey %>%
clean filter(Height_cm < 214,
< 1,
Reaction_time < 100) %>%
Social_Websites_Hours select(Gender, Height_cm, Reaction_time, Social_Websites_Hours)
# Pipe the new clean dataset into the group_by() and summarise() as above:
%>%
clean group_by(Gender) %>%
summarise(
mean_height = mean(Height_cm, na.rm=TRUE),
sd_ht = sd(Height_cm, na.rm=TRUE),
mean_react_time = mean(Reaction_time, na.rm=TRUE),
sd_react_time = sd(Reaction_time, na.rm=TRUE),
mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE),
sd_social_hrs = sd(Social_Websites_Hours, na.rm=TRUE)
)
# A tibble: 2 × 7
Gender mean_height sd_ht mean_react_time sd_react_time mean_social_media_hrs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female 158. 25.1 0.444 0.134 12.9
2 Male 174. 23.4 0.395 0.139 12.5
# ℹ 1 more variable: sd_social_hrs <dbl>
Grouping by Multiple Variables
It is simple to get summary statistics for multiple grouping factors.
EXAMPLE: Suppose we want the same means calculated above, but for gender and handedness:
<- survey %>%
clean group_by(Gender, Handed) %>%
summarise(
mean_height = mean(Height_cm, na.rm=TRUE),
mean_react_time = mean(Reaction_time, na.rm=TRUE),
mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE)
)
clean
# A tibble: 6 × 5
# Groups: Gender [2]
Gender Handed mean_height mean_react_time mean_social_media_hrs
<chr> <chr> <dbl> <dbl> <dbl>
1 Female Ambidextrous 134. 0.361 27
2 Female Left-Handed 160. 0.524 12.8
3 Female Right-Handed 159. 0.750 14.3
4 Male Ambidextrous 315. 0.348 16.8
5 Male Left-Handed 167. 2.29 16.1
6 Male Right-Handed 175. 0.420 13.5
I can also use the n()
function without any inputs to count the number of observations in each group:
<- survey %>%
clean group_by(Gender, Handed) %>%
summarise(
mean_height = mean(Height_cm, na.rm=TRUE),
mean_react_time = mean(Reaction_time, na.rm=TRUE),
mean_social_media_hrs = mean(Social_Websites_Hours, na.rm=TRUE),
N = n()
)
clean
# A tibble: 6 × 6
# Groups: Gender [2]
Gender Handed mean_height mean_react_time mean_social_media_hrs N
<chr> <chr> <dbl> <dbl> <dbl> <int>
1 Female Ambidextrous 134. 0.361 27 3
2 Female Left-Handed 160. 0.524 12.8 17
3 Female Right-Handed 159. 0.750 14.3 132
4 Male Ambidextrous 315. 0.348 16.8 6
5 Male Left-Handed 167. 2.29 16.1 19
6 Male Right-Handed 175. 0.420 13.5 135
This shows me that there are only 3 female ambidextrous students in the sample and 6 male ambidextrous students.