Finding Data

J. Hathaway

Quote of the day

It is difficult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. Francis Galton

Becoming the Critic.

Visualization of the Day

Team Discussion

Case Study 3: Becoming a databender

Case Study 4: Reducing Gun Deaths (FiveThirtyEight)

Task 7: Data to Answer Questions

Being Readings

Being a good critiquer

  • What did we like?
  • How can this “bug” reporting guide relate to our reviewer feedback?

What do people do with new data?

  • If you had to summarize this page in one sentence what would you say?
  • What did you not like or disagree with?
  • Questions on their proposed ideas?

Look at the Data(1)

Look at the Data (2)

Look at the Data (3)

Vectors

Quotes from the chapter

  • Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature.
  • There is an important variation of [ called [[. [[ only ever extracts a single element, and always drops names. It’s a good idea to use it whenever you want to make it clear that you’re extracting a single item, as in a for loop. The distinction between [ and [[ is most important for lists, as we’ll see shortly.

Vectors

The chief difference between atomic vectors and lists is that atomic vectors are homogeneous, while lists can be heterogeneous. There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

Checking Truths

lgl int dbl chr list
is_logical() x
is_integer() x
is_double() x
is_numeric() x x
is_character() x
is_atomic() x x x x
is_list() x
is_vector() x x x x x

Scalars and recycling rules

Write out this line of code and then map the full process to get to the output

Input

1:10 + 1:2

Output

#> [1] 2 4 4 6 6 8 8 10 10 12

Lists

While understanding and using functions is probably more important. Understanding how lists work and the power of lists is a very important key to becoming a master R programmer.

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, list(-5, "fish")))

What does this command do?

a[[c(4,2,2)]]

data.frame and tbl (1)

What is the difference between tibbles and data frames?

  • Never coerces inputs (i.e. strings stay as strings!).
  • Never adds row.names.
  • Never munges column names.
  • Only recycles length 1 inputs.
  • Evaluates its arguments lazily and in order.
  • Adds tbl_df class to output.
  • Automatically adds column names.

data.frame and tbl (2)

What is the difference between tibbles and data frames?

  • When printed, the tibble diff reports the class of each variable. data.frame objects do not.
  • When printing a tibble diff to screen, only the first ten rows are displayed. The number of columns printed depends on the window size.

tbl settings

  • Change the maximum and the minimum rows to print: options(tibble.print_max = 20, tibble.print_min = 6)
  • Always show all rows: options(tibble.print_max = Inf)
  • Always show all columns: options(tibble.width = Inf)