library(tidyverse)
library(mosaic)
library(rio)
library(ggplot2)
<- read_csv('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/StarWarsData_clean.csv') sw
Summarizing Categorical Data
Summarizing Categorical Data
In this section we will show how to summarize data numerically and visually.
We typically summarize categorical variables with counts and proportions. Visually, an ordered bar chart is the optimal way to express categorical data. Pie charts, while very common, are problematic because of weaknesses in basic human perception.
Let’s look at survey carried out by FiveThirtyEight about the first 6 Star Wars films.
Numerical Summaries
Use the table()
function to tabulate counts for a categorical variable. For example, if we want to tabulate the favorability of Han Solo
table(sw$`Favorability_Han Solo`)
Neither favorably nor unfavorably (neutral)
44
Somewhat favorably
151
Somewhat unfavorably
8
Unfamiliar (N/A)
15
Very favorably
610
Very unfavorably
1
You can also get proportions by inputting a table into the prop.table()
function:
prop.table(table(sw$`Favorability_Han Solo`))
Neither favorably nor unfavorably (neutral)
0.053075995
Somewhat favorably
0.182147165
Somewhat unfavorably
0.009650181
Unfamiliar (N/A)
0.018094089
Very favorably
0.735826297
Very unfavorably
0.001206273
Question: What percent of respondents are “Very favorable” towards Han Solo?
Answer:
Multiple Groups
The table()
function can make a “cross table” of 2 categorical variables. The resulting table will have rows and columns which correspond to the order of input table(row, column)
.
Let’s contrast gender with whether or not a respondent is a fan of Star Wars (Are You a Fan of SW
):
table(sw$Gender, sw$`Are You a Fan of SW?`)
No Yes
Female 158 238
Male 119 303
We can include row and column totals by wrapping our table in the addmargins()
function as follows:
addmargins(table(sw$Gender, sw$`Are You a Fan of SW?`))
No Yes Sum
Female 158 238 396
Male 119 303 422
Sum 277 541 818
This can be used to get row or column percentages. Alternatively we can use the prop.table()
function to get proportions.
prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`))
No Yes
Female 0.1931540 0.2909535
Male 0.1454768 0.3704156
The default for prop.table()
is to give the overall percentages (counts / table total). So the proportions add to 1 across the whole table.
We can specify row or column percentages by specifying a “margin.” In R, margin=1
corresponds to rows and margin = 2
corresponds to columns.
Compare the difference:
prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`), margin = 1)
No Yes
Female 0.3989899 0.6010101
Male 0.2819905 0.7180095
This table sums to 1 across the rows, meaning that about 60% of Females are fans of Star Wars and about 72% of Males are fans.
Now look at margin = 2
prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`), margin = 2)
No Yes
Female 0.5703971 0.4399261
Male 0.4296029 0.5600739
Question: What does this table show?
Answer:
NOTE: Which margin we choose to evaluate depends on the order we input columns into the table()
function. Be sure to double check that you calculate the correct percentages.
Visual Summaries
We can use ggplot()
with categorical variables to get summaries of counts using the geom_bar()
geometry.
ggplot(sw, aes(x = `Are You a Fan of SW?`)) + geom_bar()
We can add another variable to the mix to look at things by gender using the fill=
argument inside the aesthetics:
ggplot(sw, aes(x = who_shot_first, fill = Gender)) + geom_bar()
The default for geom_bar()
is to stack bars. If we want side-by-side bars we can add a “position = ‘dodge’” to the geom_bar() function:
ggplot(sw, aes(x = who_shot_first, fill = Gender)) + geom_bar(position = "dodge")
Dealing with missing values
The graphs above include missing values as its own category. The easiest way to deal with missing values is to create a subset of the data that is prepared for the graph we are interested in creating.
We first select only the columns that we will use in the visualization, then drop out all the missing values using drop_na()
in the tidy fashion.
NOTE: The drop_na()
function drops all rows with ANY missing values. If we use this function on the dataset with all the columns, we may end up losing information on the analysis of interest. This is why we do a select()
first. that way we only delete rows missing relevant information.
<- sw %>%
shot_first select(who_shot_first, Gender) %>%
drop_na()
ggplot(shot_first, aes(x = who_shot_first, fill = Gender)) +
geom_bar(position = "dodge")
Cleaning up the Graph
The default visualization elements in ggplot() can always be improved. Here are some options for making the chart more readable:
ggplot(shot_first, aes(x = who_shot_first, fill = Gender)) +
geom_bar(position = "dodge") +
theme_bw() +
labs(
x = "Which Character Shot First?",
y = "Count",
title = "Comparing response to the Question 'Who Shot First' by Gender"
)
A Certain Point of View
With categorical variables, we can group differently depending on which comparisons we would like to emphasize. Above, we grouped by responses to “who shot first” and colored by gender. If we swap the x variable and the color, we get the same bars, but arranged differently.
ggplot(shot_first, aes(x = Gender, fill = who_shot_first)) +
geom_bar(position = "dodge") +
theme_bw() +
labs(
x = "",
y = "Count",
title = "Comparing response to the Question 'Who Shot First' by Gender"
)
This different point of view makes it easier to see the breakdown of responses for each gender separately. We can see more clearly that the frequency of Females who do not understand the question is much more pronounced than on the Male side. Males, it seems largely agree that Han shot first.
Your Turn
Visualization
Create a bar chart for favorability of Han Solo by whether or not they are fans of Star Trek (fan_of_star_trek
).
Start by making a new dataset called trekky
that only includes the 2 relevant columns and drops the missing values.
<- sw %>% trekky
Error: <text>:4:0: unexpected end of input
2: trekky <- sw %>%
3:
^
Question: What observations can you make based on the visualization?
Answer:
Proportion Table
We would like to compare what percent of female respondents do not understand the question compared to the percent of males who do not understand the question.
Create a proportion table that can answer this question:
Question: What percent of female respondents do not understand the question?
Answer:
Question: What percent of male respondents do not understand the question?
Answer: