We will use a baseball relational database to explore SQL in Python for data science applications. Finding relationships in baseball
Completed Readings: SQL for Data Science Readings (read all links) and Why SQL is beating NoSQL, and what this means for the future of data
Use the data.world baseball url for the Data Conneection. You can read the
Connection Instructions for data.world here
Grand Questions
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
- Write an SQL query that provides playerID, yearID, and batting average for players with at least one at bat. Sort the table from highest batting average to lowest, and show the top 5 results in your report.
- Use the same query as above, but only include players with more than 10 “at bats” that year. Print the top 5 results.
- Now calculate the batting average for players over their entire careers (all years combined). Only include players with more than 100 at bats, and print the top 5 results.
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc.). Write an SQL query to get the data you need. Use Python if additional data wrangling is needed, then make a graph in Altair to visualize the comparison. Provide the visualization and the compiled Vega script that would build the visualization.