SQL queries are typed in the following pattern;
SELECT -- <columns> and <column calculations>
FROM -- <table name>
JOIN -- <table name>
ON -- <columns to join>
WHERE -- <filter condition>
GROUP BY -- <subsets for column calculations>
ORDER BY -- <how the output is returned in sequence>
LIMIT -- <number of rows to return>
Connecting to SQLite: Lahman SQLite
Download the sqlite file: Lahman sqlite
What is SQLite?
- Wikipedia: SQLite is a popular choice as embedded database software for local/client storage in application software such as web browsers. It is arguably the most widely deployed database engine, as it is used today by several widespread browsers, operating systems, and embedded systems (such as mobile phones), among others. SQLite has bindings to many programming languages.
- SQLite.org: SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.
- Codecademy: SQLite is a database engine. It is software that allows users to interact with a relational database. In SQLite, a database is stored in a single file — a trait that distinguishes it from other database engines. This fact allows for a great deal of accessibility: copying a database is no more complicated than copying the file that stores the data, sharing a database can mean sending an email attachment.
Working with SQLite files in Python
# %%
import pandas as pd
import altair as alt
import numpy as np
import sqlite3
# %%
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
# %%
# See the tables in the database
table = pd.read_sql_query(
"SELECT * FROM sqlite_master WHERE type='table'",
con)
print(table.filter(['name']))
print('\n\n')
# 8 is collegeplaying
print(table.sql[8])
Work with a subset in pandas
Often, you can pull a small subset of the data and work through the logic in Python to make sure you are working out the logic correctly.
- Only want to pull small parts of each table needed.
- Small part should be a a complete division. For example, lets use Idaho.
- Then use Pandas to work out all the table join logic.
- Check your work against the SQL call.
How can we check our SQL logic?
For seasons after 1999, which year had the most players selected as All Stars but didn’t play in the All Star game?
- Provide a summary of how many games, hits, and at bats occurred by those players had in that years post season.
pd.read_sql_query(
'''
SELECT bp.yearid, sum(ab) as ab, sum(h) as h,
sum(g) as games, count(DISTINCT bp.playerid) as num_players,
asf.gp, asf.gameid
FROM BattingPost as bp
JOIN AllstarFull as asf
ON bp.playerid = asf.playerid AND
bp.yearid = asf.yearid
WHERE bp.yearid > 1999
AND gp == 0
GROUP BY bp.yearid
ORDER BY bp.yearid
''',
con)
Let’s start by pulling a subset of our two tables
bp = pd.read_sql_query(
'''
SELECT *
FROM battingpost
WHERE yearid == 2017
''', con)
alst = pd.read_sql_query(
'''
SELECT *
FROM allstarfull
WHERE yearid == 2017
''', con)
Now we can join them
bp_noals = (bp.merge(
alst.filter(['playerID', 'yearID', "GP"]),
on = ['playerID', 'yearID'])
.query('GP == 0'))
Using .groupby()
, agg()
, and .reset_index()
lets recreate the year 2017 line. We can use 'sum'
and 'nunique'
in our .agg()
method.