Strings & Joins

J. Hathaway

Becoming the Critic

Visualization of the Day

A visualization and the steps

Visualizing Gun Deaths

  1. Be open to discovering new insights
  2. Think big but start small
  3. Design for your user
  4. Prototype to identify needs (Sketch to code)
  5. Obtain feedback early and often

“If I had asked my customers what they wanted they would have told me faster horses.”

Team Discussion

Case Study 5: I can clean your data

Case Study 6: The collapse of construction in Idaho

Task 11: Strings and grep

Male Heights

What did we learn about Male Heights (1)

Height variations within a population are largely genetic, but height variations between populations are mostly environmental, anthropometric history suggests.

  • If the average Norwegian is taller than the average Nigerian it’s because Norwegians live healthier lives. That’s why the United Nations now uses height to monitor nutrition in developing countries.

New Yorker 2004

What did we learn about Male Heights (2)

  • Yet in Northern Europe over the past twelve hundred years human stature has followed a U-shaped curve: from a high around 800 A.D., to a low sometime in the seventeenth century, and back up again.

New Yorker 2004

Regular Expressions

Some Background on RegEx

A few jokes

Special Characters

  • To include a literal single or double quote in a string you can use  to “escape” it:
  • That means if you want to include a literal backslash, you’ll need to double it up: “\”.
  • Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():

Special Characters (2)

There are a handful of other special characters. The most common are

"\n", newline, 
"\t", tab, 
you can see the complete list by requesting help on ": ?'"', or ?"'". 
You’ll also sometimes see strings like "\u00b5", this is a way of writing non-English characters that works on all platforms:

Using Regex

Don’t forget that you’re in a programming language and you have other tools at your disposal. Instead of creating one complex regular expression, it’s often easier to write a series of simpler regexps. If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

The Game

Use https://byuistats.github.io/M335/data/randomletters.txt from Task 11

  1. Remove all the ‘e’ and ‘a’ letters and then tell me how long the string is.
  2. How many times is the name ‘jim’ in the string? Can you find any other names?
  3. Show all the sequences with 5 of the same letter in a row.
  4. Tell me which character locations have three “a”’s in a row.
  5. Split the characters so each letter is an individual item in a vector.

Relational Data and R

Defining Terms

  • A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.
  • A foreign key uniquely identifies an observation in another table. For example, the flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.
  • A Left Join is a mutating join.

Managing Use Problems

Duplicate Keys

Missing Keys

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.
  • Filtering joins can be used for other scenarios as well.

Other Cases (merge())

base::merge() can perform all four types of joins:

dplyr merge
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE),
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

<>

  • specific dplyr verbs more clearly convey the intent of your code: they are concealed in the arguments of merge().
  • dplyr’s joins are considerably faster and don’t mess with the order of the rows.

Other Cases (SQL)

SQL is the inspiration for dplyr’s conventions, so the translation is straightforward:

dplyr SQL
inner_join(x, y, by = “z”) SELECT * FROM x INNER JOIN y USING (z)
left_join(x, y, by = “z”) SELECT * FROM x LEFT OUTER JOIN y USING (z)
right_join(x, y, by = “z”) SELECT * FROM x RIGHT OUTER JOIN y USING (z)
full_join(x, y, by = “z”) SELECT * FROM x FULL OUTER JOIN y USING (z)

<>

  • Note that “INNER” and “OUTER” are optional, and often omitted.
  • SQL supports a wider range of join types than dplyr

Thursday’s Process

Background (Learning how to learn)

Learning how to articulate good questions is strongly correlated with quality structured thinking

Plan for next class meeting

The baseball joining task for Thursday is tricky and requires focus.

  1. Those students that complete it before class will be placed in teams with other students that are having difficulties with the task.
  2. Each group will need to work on their own to structure a plan to complete the task (Think data flow diagrams).
    1. Please store your questions and diagrams with your project.
    2. Remember the structured thinking point, “For doing any discussion with [us], you were enforced to create a note explaining the problem and the solution, [showing] it to [us] before hand and then [we] [will] get involved in a discussion.”