Getting into Data, dplyr and ggplot2

J. Hathaway

Becoming the Critic.

Visualization of the Day

Team Discussion

Case Study 1: Critiquing Visualizations and Slack Setup

The Google spreadsheet link
Case Study 1
How did we do?

Case Study 2: Wealth and Life Expectancy (Gapminder)

Case Study 2

Task 3: Asking the right questions

Task 3

How do we know when we have given voice to data?

Asking the right questions

Harness the power of questions & Clarify Terms

Hans Rosling and Data Interaction

I put them together, so that in each pair of country, one has twice the child mortality of the other. And this means that it’s much bigger a difference than the uncertainty of the data.
I have shown that Swedish top students know statistically significantly less about the world than the chimpanzees.. The problem was not ignorance it was preconceived ideas.
It’s a tremendous variation within Africa which we rarely often make – that it’s equal everything.
Now, this is, more or less, if you look at the average data of the countries – they are like this. Now that’s dangerous, to use average data, because there is such a lot of difference within countries.

Boaz Super (Hiring the best data scientist)

Logical thinking requires one additional, vital component: a commitment to intellectual honesty. That means not allowing oneself to bend to one’s desire for a particular outcome.
[Questions that matter] can be revealing [as to how] they think about the context of a problem as opposed to just carrying out an analysis.

Stephen Few (Effectively Communicating Numbers)

… a more common problem and one that is much more insidious because it is so seldom recognized, is the unintended miscommunication of quantitative information that happens because people have never learned how to communicate it effectively.
- Most business graphs that I see fit into this category. They communicate poorly if at all.

Stephen Few’s Steps (Effectively Communicating Numbers)

Message
Graphics
Data munging.
Fine tune the message.
Clarify the point.

Wrangling Data

The pipe `%>%`

You can read it as a series of imperative statements: group, then summarize, then filter. As suggested by the reading, a good way to pronounce %>% when reading code is “then”.

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on.
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
We’ll use piping frequently from now on because it considerably improves the readability of code.

library(dplyr) Part 1

filter() - filter your data to a smaller set of important rows.
arrange() - Organize the row order of my data
select() - select specific columns to keep or remove
mutate() - add new mutated (changed) variables as columns to my data.

library(dplyr) Part 2

summarise() - build summaries of the columns specified
group_by() - divide your data into groups. Often used with summarise()

Practice reading code

With your table, write this code out in an English paragraph.

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

Practice using dplyr

Use filter(), arrange(), select(), mutate(), group_by(), and summarise(). With library(tidyverse) tackle the following challenges.

Arrange the iris data by Sepal.Length and display the first six rows.
Select the Species and Petal.Width columns and put them into a new data set called testdat.
Create a new table that has the mean for each variable by Species.
Read about the ?summarise_all() function and get a new table with the means and standard deviations for each Species.
Look at the examples in the summarise_all() help file and see if you can find other use cases for the summarise_ or mutate_ functions.

The Grammar of Graphics

Introduction to the Grammar

Introduction to ggplot2

ggplot2 and iris data

Use the iris data to show a faceted visualization with a color, shape, and size layer or geometry.

1

Getting into Data, dplyr and ggplot2 J. Hathaway