Day 13: How do I check my SQL work?

SQL queries are typed in the following pattern;

SELECT -- <columns> and <column calculations>
FROM -- <table name>
  JOIN -- <table name>
  ON -- <columns to join>
WHERE -- <filter condition>
GROUP BY -- <subsets for column calculations>
ORDER BY -- <how the output is returned in sequence>
LIMIT -- <number of rows to return>

Connecting to SQLite: Lahman SQLite

Download the sqlite file: Lahman sqlite

What is SQLite?

  • Wikipedia: SQLite is a popular choice as embedded database software for local/client storage in application software such as web browsers. It is arguably the most widely deployed database engine, as it is used today by several widespread browsers, operating systems, and embedded systems (such as mobile phones), among others. SQLite has bindings to many programming languages.
  • SQLite.org: SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.
  • Codecademy: SQLite is a database engine. It is software that allows users to interact with a relational database. In SQLite, a database is stored in a single file — a trait that distinguishes it from other database engines. This fact allows for a great deal of accessibility: copying a database is no more complicated than copying the file that stores the data, sharing a database can mean sending an email attachment.

Working with SQLite files in Python

The Schema Table in SQLite

# %%
import pandas as pd 
import altair as alt
import numpy as np
import sqlite3

# %%
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
# %%
# See the tables in the database
table = pd.read_sql_query(
    "SELECT * FROM sqlite_master WHERE type='table'",
    con)
print(table.filter(['name']))
print('\n\n')
# 8 is collegeplaying
print(table.sql[8])

Work with a subset in pandas

Often, you can pull a small subset of the data and work through the logic in Python to make sure you are working out the logic correctly.

  1. Only want to pull small parts of each table needed.
  2. Small part should be a a complete division. For example, lets use Idaho.
  3. Then use Pandas to work out all the table join logic.
  4. Check your work against the SQL call.

How can we check our SQL logic?

For seasons after 1999, which year had the most players selected as All Stars but didn’t play in the All Star game?

  • Provide a summary of how many games, hits, and at bats occurred by those players had in that years post season.
pd.read_sql_query(
'''
SELECT bp.yearid, sum(ab) as ab, sum(h) as h,
    sum(g) as games, count(DISTINCT bp.playerid) as num_players, 
    asf.gp, asf.gameid
FROM BattingPost as bp
JOIN AllstarFull as asf
    ON  bp.playerid = asf.playerid AND
        bp.yearid = asf.yearid
WHERE bp.yearid > 1999
    AND gp == 0
GROUP BY bp.yearid
ORDER BY bp.yearid
''',
con)

Let’s start by pulling a subset of our two tables

bp = pd.read_sql_query(
'''
SELECT *
FROM battingpost
WHERE yearid == 2017
''', con)

alst = pd.read_sql_query(
'''
SELECT *
FROM allstarfull
WHERE yearid == 2017
''', con)

Now we can join them

bp_noals = (bp.merge(
        alst.filter(['playerID', 'yearID', "GP"]), 
        on = ['playerID', 'yearID'])
    .query('GP == 0'))

Using .groupby(), agg(), and .reset_index() lets recreate the year 2017 line. We can use 'sum' and 'nunique' in our .agg() method.

Open Programming Time