SQL queries are typed in the following pattern;
SELECT -- <columns> and <column calculations>
FROM -- <table name>
JOIN -- <table name>
ON -- <columns to join>
WHERE -- <filter condition>
GROUP BY -- <subsets for column calculations>
HAVING -- <filter conditions on groups>
ORDER BY -- <how the output is returned in sequence>
LIMIT -- <number of rows to return>
Connecting to data.world in Python
Let’s make the connection
import datadotworld as dw
results = dw.query('byuidss/cse-250-baseball-database',
'SELECT * FROM batting LIMIT 5')
batting5 = results.dataframe
Let’s view our tables
I want to do a calculation in SQL and return it in a new column in Python?
Use the batting table to show the player and his team with his at batts and runs together with a calculated value of ab / r
that is called runs_atbat
.
- Try do complete the above statement without using the info in the questions below.
q = '''
SELECT *
FROM batting
LIMIT 5
'''
dw.query('byuidss/cse-250-baseball-database', q).dataframe
q = '''
SELECT playerid, teamid, ab, r
FROM batting
LIMIT 5
'''
dw.query('byuidss/cse-250-baseball-database', q).dataframe
q = '''
SELECT playerid, teamid, ab, r, ab/r
FROM batting
LIMIT 5
'''
batting_calc = dw.query('byuidss/cse-250-baseball-database', q).dataframe
q = '''
SELECT playerid, teamid, ab, r, ab/r as runs_atbat
FROM batting
LIMIT 5
'''
batting_calc = dw.query('byuidss/cse-250-baseball-database', q).dataframe
I want to join two tables to help in decision making
For seasons after 1999, which year had the most players selected as All Stars but didn’t play in the All Star game?
- Provide a summary of how many games, hits, and at bats occurred by those players had in that years post season.
import pandas as pd
import altair as alt
import numpy as np
import datadotworld as dw
con_url = 'byuidss/cse-250-baseball-database'
# %%
# allstar table
dw.query(con_url,
'''
SELECT *
FROM AllstarFull
WHERE
AND
LIMIT 5
''').dataframe
dw.query(con_url,
'''
SELECT yearid, -- <stuff to calculate>
FROM AllstarFull
WHERE yearid > 1999
AND gp != 1
GROUP BY --?
ORDER BY --?
''').dataframe
dw.query(con_url,
'''
SELECT *
FROM BattingPost as bp
LIMIT 5
''').dataframe
Let’s only keep players with at least one at bat in the post season
dw.query(con_url,
'''
SELECT -- <columns to keep>
FROM BattingPost as bp
JOIN AllstarFull as asf
ON -- <two columns for the join>
WHERE bp.yearid > 1999
AND gp != 1
AND -- <at bat condition>
LIMIT 15
'''
).dataframe
For seasons after 1999, which year had the most players selected as All Stars but didn’t play in the All Star game?
- Provide a summary of how many games, hits, and at bats occurred by those players had in that years post season.
dw.query('byuidss/cse-250-baseball-database',
'''
SELECT -- <lots of calculations>
FROM BattingPost as bp
JOIN AllstarFull as asf
ON bp.playerid = asf.playerid AND
bp.yearid = asf.yearid
WHERE bp.yearid > 1999
AND gp != 1
AND ab > 0
GROUP BY -- <column>
ORDER BY -- <column>
'''
).dataframe