Summarizing Categorical Data

Summarizing Categorical Data

In this section we will show how to summarize data numerically and visually.

We typically summarize categorical variables with counts and proportions. Visually, an ordered bar chart is the optimal way to express categorical data. Pie charts, while very common, are problematic because of weaknesses in basic human perception.

Let’s look at survey carried out by FiveThirtyEight about the first 6 Star Wars films.

library(tidyverse)
library(mosaic)
library(rio)
library(ggplot2)

sw <- read_csv('https://raw.githubusercontent.com/byuistats/Math221D_Cannon/master/Data/StarWarsData_clean.csv')

Numerical Summaries

Use the table() function to tabulate counts for a categorical variable. For example, if we want to tabulate the favorability of Han Solo

table(sw$`Favorability_Han Solo`)

Neither favorably nor unfavorably (neutral) 
                                         44 
                         Somewhat favorably 
                                        151 
                       Somewhat unfavorably 
                                          8 
                           Unfamiliar (N/A) 
                                         15 
                             Very favorably 
                                        610 
                           Very unfavorably 
                                          1 

You can also get proportions by inputting a table into the prop.table() function:

prop.table(table(sw$`Favorability_Han Solo`))

Neither favorably nor unfavorably (neutral) 
                                0.053075995 
                         Somewhat favorably 
                                0.182147165 
                       Somewhat unfavorably 
                                0.009650181 
                           Unfamiliar (N/A) 
                                0.018094089 
                             Very favorably 
                                0.735826297 
                           Very unfavorably 
                                0.001206273 

Question: What percent of respondents are “Very favorable” towards Han Solo?
Answer:

Multiple Groups

The table() function can make a “cross table” of 2 categorical variables. The resulting table will have rows and columns which correspond to the order of input table(row, column).

Let’s contrast gender with whether or not a respondent is a fan of Star Wars (Are You a Fan of SW):

table(sw$Gender, sw$`Are You a Fan of SW?`)
        
          No Yes
  Female 158 238
  Male   119 303

We can include row and column totals by wrapping our table in the addmargins() function as follows:

addmargins(table(sw$Gender, sw$`Are You a Fan of SW?`))
        
          No Yes Sum
  Female 158 238 396
  Male   119 303 422
  Sum    277 541 818

This can be used to get row or column percentages. Alternatively we can use the prop.table() function to get proportions.

prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`))
        
                No       Yes
  Female 0.1931540 0.2909535
  Male   0.1454768 0.3704156

The default for prop.table() is to give the overall percentages (counts / table total). So the proportions add to 1 across the whole table.

We can specify row or column percentages by specifying a “margin.” In R, margin=1 corresponds to rows and margin = 2 corresponds to columns.

Compare the difference:

prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`), margin = 1)
        
                No       Yes
  Female 0.3989899 0.6010101
  Male   0.2819905 0.7180095

This table sums to 1 across the rows, meaning that about 60% of Females are fans of Star Wars and about 72% of Males are fans.

Now look at margin = 2

prop.table(table(sw$Gender, sw$`Are You a Fan of SW?`), margin = 2)
        
                No       Yes
  Female 0.5703971 0.4399261
  Male   0.4296029 0.5600739

Question: What does this table show?
Answer:

NOTE: Which margin we choose to evaluate depends on the order we input columns into the table() function. Be sure to double check that you calculate the correct percentages.

Visual Summaries

We can use ggplot() with categorical variables to get summaries of counts using the geom_bar() geometry.

ggplot(sw, aes(x = `Are You a Fan of SW?`)) + geom_bar()

We can add another variable to the mix to look at things by gender using the fill= argument inside the aesthetics:

ggplot(sw, aes(x = who_shot_first, fill = Gender)) + geom_bar()

The default for geom_bar() is to stack bars. If we want side-by-side bars we can add a “position = ‘dodge’” to the geom_bar() function:

ggplot(sw, aes(x = who_shot_first, fill = Gender)) + geom_bar(position = "dodge")

Dealing with missing values

The graphs above include missing values as its own category. The easiest way to deal with missing values is to create a subset of the data that is prepared for the graph we are interested in creating.

We first select only the columns that we will use in the visualization, then drop out all the missing values using drop_na() in the tidy fashion.

NOTE: The drop_na() function drops all rows with ANY missing values. If we use this function on the dataset with all the columns, we may end up losing information on the analysis of interest. This is why we do a select() first. that way we only delete rows missing relevant information.

shot_first <- sw %>%
  select(who_shot_first, Gender) %>%
  drop_na()

ggplot(shot_first, aes(x = who_shot_first, fill = Gender)) + 
  geom_bar(position = "dodge")

Cleaning up the Graph

The default visualization elements in ggplot() can always be improved. Here are some options for making the chart more readable:

ggplot(shot_first, aes(x = who_shot_first, fill = Gender)) + 
  geom_bar(position = "dodge") +
  theme_bw() +
  labs(
    x = "Which Character Shot First?",
    y = "Count",
    title = "Comparing response to the Question 'Who Shot First' by Gender" 
  )

A Certain Point of View

With categorical variables, we can group differently depending on which comparisons we would like to emphasize. Above, we grouped by responses to “who shot first” and colored by gender. If we swap the x variable and the color, we get the same bars, but arranged differently.

ggplot(shot_first, aes(x = Gender, fill = who_shot_first)) + 
  geom_bar(position = "dodge") +
  theme_bw() +
  labs(
    x = "",
    y = "Count",
    title = "Comparing response to the Question 'Who Shot First' by Gender" 
  )

This different point of view makes it easier to see the breakdown of responses for each gender separately. We can see more clearly that the frequency of Females who do not understand the question is much more pronounced than on the Male side. Males, it seems largely agree that Han shot first.

Your Turn

Visualization

Create a bar chart for favorability of Han Solo by whether or not they are fans of Star Trek (fan_of_star_trek).

Start by making a new dataset called trekky that only includes the 2 relevant columns and drops the missing values.


trekky <- sw %>%
Error: <text>:4:0: unexpected end of input
2: trekky <- sw %>%
3: 
  ^

Question: What observations can you make based on the visualization?
Answer:

Proportion Table

We would like to compare what percent of female respondents do not understand the question compared to the percent of males who do not understand the question.

Create a proportion table that can answer this question:

Question: What percent of female respondents do not understand the question?
Answer:

Question: What percent of male respondents do not understand the question?
Answer: