SQL & databases (4 days)

Skill builder (relational database)

For this skill builder, we are exploring some important topics in relational databases. This exercise will require you to create SQL queries through python. You may want to at least scan the readings before beginning this task since this serves as an assessment of your understanding of the assigned readings.

A competent student should be able to finish the exercises within 75 minutes.

Before you start

Make sure you have installed VS-code, pandas, Altair, and datadotworld on your computer.

Also make sure you have gone through the tutorial on under course materials called SQL for Data Science: we assume that you imported datadotworld as dw (import datadotworld as dw).

Exercise 1

Readme file

A database can consist of more than one table/data set. A relational database consists of tables/data sets that share columns. These shared columns then establish the relationship between the tables, thus the name relational database. The relations are sometimes not easily found and they require careful investigations.

To understand what is in a relational database, we can start with understanding the tables and the columns within.

Here is a link to the readme file of the baseball database.

What is the name of the table that records data about pitchers in the regular seasons?

What do the HR and HBP columns mean in that table respectively?

Excercise 2 SELECT and FROM

The simplest SQL query is a query with SELECT and FROM. These are the keywords you will see again and again in SQL. Usually, when constructing a more complex query, it is easier to identify what goes into these two clauses first.

Create a query that shows all columns from the table you found in Exercise 1, save the dataframe in a variable “pitch”

You script should look something like:

result = dw.query('byuidss/cse-250-baseball-database', 
    'SELECT _______ FROM _______')

pitch = result.dataframe

Excercise 2 WHERE

The WHERE keyword allows us to filter down the table horizontally (fewer rows).

It goes after SELECT and FROM.

Using a SQL query, select all rows in the same table where HR is lesser than 10 and gs is greater than 25.

Find out what the columns mean and explain your query in words

Excercise 3 ORDER BY

ORDER BY sort the table you select by one or more columns and goes after WHERE

Using the same query in exercise 2, edit it so that the table is ordered by the year of the season(nearest to furthermost) and the player ID(alphabetically).

Excercise 4 Joins

Joins are used when you wish to create a new table through two different tables. Keep in mind that you have to identify the relationship between two tables before you can correctly join them.

JOIN goes between FROM and WHERE.

Identify the shared columns (keys) and join the table in exercise 2 with the salaries table, then filter the data so that it shows only pitchers in the year 1986.

You should get a dataframe with 306 rows.

Exercise 5 Group by

Group by is a keyword we use to lower the level of granularity of a table. Meaning we are combining rows into one by the given column(s).

Create a query that captures the number of pitchers the Washington Nationals used in each year, then sort the table by year

You should get a dataframe with 23 rows.

For the overachievers

Excercise 6

Research the order of operations for SQL and put the following keywords in that order.

  • SELECT
  • FROM
  • JOIN
  • WHERE
  • HAVING
  • ORDER BY
  • GROUP BY
  • LIMIT