library(tidyverse)
library(mosaic)
library(rio)
library(ggplot2)
<- read_csv('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/StarWarsData_clean.csv') sw
Summarizing Categorical Data
Summarizing Categorical Data
In this section we will show how to summarize data numerically and visually. We will be using survey responses about Star Wars. The survey was carried out by FiveThirtyEight about the first 6 Star Wars films. The survey contains demographic information as well as movie rankings and character favorability rankings.
Load the Data and Libraries
Numerical Summaries
One Variable
We typically summarize categorical variables with counts or proportions. When we want to summarize a single categorical variable by itself, we can use the table()
function to get counts. For example, if we want to tabulate the favorability of Han Solo:
table(sw$`Favorability_Han Solo`)
Neither favorably nor unfavorably (neutral)
44
Somewhat favorably
151
Somewhat unfavorably
8
Unfamiliar (N/A)
15
Very favorably
610
Very unfavorably
1
This shows us the counts of respondents in each response category.
We can also use the prop.table()
function to get the proportion of respondents, rather than the counts, by inputing a table.
prop.table(table(sw$`Favorability_Han Solo`))
Neither favorably nor unfavorably (neutral)
0.053075995
Somewhat favorably
0.182147165
Somewhat unfavorably
0.009650181
Unfamiliar (N/A)
0.018094089
Very favorably
0.735826297
Very unfavorably
0.001206273
NOTE: The prop.table()
function needs a table as an input, not a data column.
Question: What percent of respondents are “Very favorable” towards Han Solo?
Answer:
Question: What percent of respondents are “Very unfavorable” towards Han Solo?
Answer:
Two Variables
We can use Contingency Tables to examine associations between 2 categorical variables. A contingency table displays the counts of combinations of 2 categorical variables each being represented as a row or a column. It is easy to create a contingency table in R
by inputting 2 data columns into the table()
function. The resulting table will have rows and columns which correspond to the order of input table(row, column)
.
Let’s contrast gender with whether or not a respondent is a fan of Star Wars (Are You a Fan of SW
):
table(sw$Gender, sw$`Are You a Fan of SW?`)
No Yes
Female 158 238
Male 119 303
We can include row and column totals by wrapping our table in the addmargins()
function as follows:
addmargins(table(sw$Gender, sw$`Are You a Fan of SW?`))
No Yes Sum
Female 158 238 396
Male 119 303 422
Sum 277 541 818
This can be used to get row or column percentages. Alternatively we can use the prop.table()
function to get proportions.
prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`))
No Yes
Female 0.1931540 0.2909535
Male 0.1454768 0.3704156
The default for prop.table()
is to give the overall percentages (counts / table total). So the proportions add to 1 across the whole table.
We can specify row or column percentages by specifying a “margin.” In R, margin=1
corresponds to rows and margin = 2
corresponds to columns.
Compare the difference:
prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`), margin = 1)
No Yes
Female 0.3989899 0.6010101
Male 0.2819905 0.7180095
This table sums to 1 across the rows, meaning that about 60% of Females are fans of Star Wars and about 72% of Males are fans.
Now look at margin = 2
prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`), margin = 2)
No Yes
Female 0.5703971 0.4399261
Male 0.4296029 0.5600739
Question: What does this table show?
Answer:
NOTE: Which margin we choose to evaluate depends on the order we input columns into the table()
function. Be sure to double check that you calculate the correct percentages.
Visual Summaries
Visually, bar charts are the optimal way to express categorical data. Pie charts, while very common, are problematic because of weaknesses in basic human perception.
We can use ggplot()
with categorical variables to get summaries of counts using the geom_bar()
geometry.
ggplot(sw, aes(x = `Are You a Fan of SW?`)) +
geom_bar()
We can add another variable to the mix to look at things by gender using the fill=
argument inside the aesthetics:
ggplot(sw, aes(x = who_shot_first, fill = Gender)) +
geom_bar()
The default for geom_bar()
is to stack bars. If we want side-by-side bars we can add a “position = ‘dodge’” to the geom_bar() function:
ggplot(sw, aes(x = who_shot_first, fill = Gender)) +
geom_bar(position = "dodge")
Dealing with missing values
The graphs above include missing values as its own category. The easiest way to deal with missing values is to create a subset of the data that is prepared for the graph we are interested in creating.
You can filter out the missing values using filter()
or use the combination of select()
with drop_na()
in the following way:
<- sw %>%
shot_first_data filter(who_shot_first != "",
!= "")
Gender
# Or using drop_na()
<- sw %>%
shot_first_data select(Gender, who_shot_first) %>%
drop_na()
NOTE: The drop_na()
function drops all rows with ANY missing values. If we use this function on the dataset with all the columns, we may end up losing information on the analysis of interest. This is why we do a select()
first. that way we only delete rows missing relevant information.
Now look at the graph without missing values.
ggplot(shot_first_data, aes(x = who_shot_first, fill = Gender)) +
geom_bar(position = "dodge")
Cleaning up the Graph
The default visualization elements in ggplot() can always be improved. Here are some options for making the chart more readable:
ggplot(shot_first_data, aes(x = who_shot_first, fill = Gender)) +
geom_bar(position = "dodge") +
theme_bw() +
labs(
x = "Which Character Shot First?",
y = "Count",
title = "Comparing response to the Question 'Who Shot First' by Gender"
)
A Certain Point of View
With categorical variables, we can group differently depending on which comparisons we would like to emphasize. Above, we grouped by responses to “who shot first” and colored by gender. If we swap the x variable and the color, we get the same bars, but arranged differently.
ggplot(shot_first_data, aes(x = Gender, fill = who_shot_first)) +
geom_bar(position = "dodge") +
theme_bw() +
labs(
x = "",
y = "Count",
title = "Comparing response to the Question 'Who Shot First' by Gender"
)
This different point of view makes it easier to see the breakdown of responses for each gender separately. We can see more clearly that the frequency of Females who do not understand the question is much more pronounced than on the Male side. Males, it seems largely agree that Han shot first.
Alternatively,
ggplot(shot_first_data, aes(x = who_shot_first, fill = Gender)) +
geom_bar(position = "dodge") +
theme_bw() +
labs(
x = "",
y = "Count",
title = "Comparing response to the Question 'Who Shot First' by Gender"
)
Alternative to a side-by-side bar chart, we can “facet” the graph which splits up the panels on one of the variables. This can be useful when we want to emphasize specific comparisons.
Separating on Gender
ggplot(shot_first_data, aes(x = who_shot_first, fill = Gender)) +
geom_bar() +
facet_wrap(~Gender)
Separating on ‘Who Shot First’
ggplot(shot_first_data, aes(x = Gender, fill = Gender)) +
geom_bar(position = "dodge") +
facet_wrap(~who_shot_first)
How you split the graph depends on which comparison is the most important. The first graph emphasizes the differences in how the question was answered. The second makes it easy to compare female and male responses.
Your Turn
Visualization
Create a bar chart for favorability of Han Solo by whether or not they are fans of Star Trek (fan_of_star_trek
).
Start by making a new dataset called trekky
that only includes the 2 relevant columns and drops the missing values.
<- sw %>% trekky
Error in parse(text = input): <text>:4:0: unexpected end of input
2: trekky <- sw %>%
3:
^
Question: What observations can you make based on the visualization?
Answer:
Proportion Table
We would like to compare what percent of female respondents do not understand the question compared to the percent of males who do not understand the question.
Create a proportion table that can answer this question:
Question: What percent of female respondents do not understand the question?
Answer:
Question: What percent of male respondents do not understand the question?
Answer: