Getting into Data, dplyr and ggplot2

J. Hathaway

Becoming the Critic.

Visualization of the Day

Team Discussion

Case Study 1: Critiquing Visualizations and Slack Setup

Case Study 2: Wealth and Life Expectancy (Gapminder)

Task 3: Asking the right questions

How do we know when we have given voice to data?

Asking the right questions

Hans Rosling and Data Interaction

Boaz Super (Hiring the best data scientist)

  • Logical thinking requires one additional, vital component: a commitment to intellectual honesty. That means not allowing oneself to bend to one’s desire for a particular outcome.
  • [Questions that matter] can be revealing [as to how] they think about the context of a problem as opposed to just carrying out an analysis.

Stephen Few (Effectively Communicating Numbers)

  • … a more common problem and one that is much more insidious because it is so seldom recognized, is the unintended miscommunication of quantitative information that happens because people have never learned how to communicate it effectively.
    • Most business graphs that I see fit into this category. They communicate poorly if at all.

Stephen Few’s Steps (Effectively Communicating Numbers)

  1. Message
  2. Graphics
  3. Data munging.
  4. Fine tune the message.
  5. Clarify the point.

Wrangling Data

The pipe %>%

You can read it as a series of imperative statements: group, then summarize, then filter. As suggested by the reading, a good way to pronounce %>% when reading code is “then”.

  • Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on.
  • You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
  • We’ll use piping frequently from now on because it considerably improves the readability of code.

library(dplyr) Part 1

  • filter() - filter your data to a smaller set of important rows.
  • arrange() - Organize the row order of my data
  • select() - select specific columns to keep or remove
  • mutate() - add new mutated (changed) variables as columns to my data.

library(dplyr) Part 2

  • summarise() - build summaries of the columns specified
  • group_by() - divide your data into groups. Often used with summarise()

Practice reading code

With your table, write this code out in an English paragraph.

Practice using dplyr

Use filter(), arrange(), select(), mutate(), group_by(), and summarise(). With library(tidyverse) tackle the following challenges.

  1. Arrange the iris data by Sepal.Length and display the first six rows.
  2. Select the Species and Petal.Width columns and put them into a new data set called testdat.
  3. Create a new table that has the mean for each variable by Species.
  4. Read about the ?summarise_all() function and get a new table with the means and standard deviations for each Species.
  5. Look at the examples in the summarise_all() help file and see if you can find other use cases for the summarise_ or mutate_ functions.

The Grammar of Graphics

Introduction to the Grammar

Introduction to ggplot2

ggplot2 and iris data

Use the iris data to show a faceted visualization with a color, shape, and size layer or geometry.