Project 2: Finding Relationships in Baseball
Walkthrough
SQL Refresher
Setup
SQL setup and test
Background
We will complete six projects during the semester that each take about two weeks (four days of class). On average, a student will spend 2 hours outside of class per hour in class to complete the assigned readings, submit any Canvas items, and complete the project (for a total of 8 hours per project). The instruction for each project will be structured into sections as written on this page.
This first Background section provides context for the project. Make sure you read the background carefully to see the big picture needs and purpose of the project.
When you hear the word “relationship” what is the first thing that comes to mind? Probably not baseball. But a relationship is simply a way to describe how two or more objects are connected. There are many relationships in baseball such as those between teams and managers, players and salaries, even stadiums and concession prices.
The graphs on Data Visualizations from Best Tickets show many other relationships that exist in baseball.
Client Request
For this project, the Client wants SQL queries that they can use to retrieve data for use on their website without needing Python. They would also like to see the results in Lets-Plot charts.
Data
Every data science project should start with data, and our class projects are no different. Each project will have ‘URL’ and ‘Information’ links like the ones below. Project 3 will use the Lahman Baseball Database. You will need to download the database and set it up on your computer to complete the project. Place it inside the DS250 Projects folder within your repository next to the project_3.qmd file. Note: Right click the ‘Download’ link and select “Save Link As” to download the data to your computer.
Download: lahmansbaseballdb
Information: Lahman Data Dictionary
Setup Instructions: See SQL Setup
Readings
The Readings section will contain links to reading assignments that are required for each project, as well as optional references. Remember that you are reading this material to build skills. Take the time to comprehend the readings and the skills contained within.
We recommend reading through the assigned material once for a general understanding before the first day of each project. You will reread and reference the material multiple times as you complete the project.
- SQL Setup and References (Read)
- SQL for Data Science (Read)
Optional References
Questions and Tasks (Core)
This section lists the questions and tasks that need to be completed for the project. Your work on the project must be compiled into a report, rendered to a HTML file and uploaded in Canvas.
There are two types of questions: Core and Stretch. Core questions are required for each project. The course syllabus competencies requires specic a number of projects having all the Stretch questions achived based on your goals for the grade level you are seeking.
Download this Project2 Template template.
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
- Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
- Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
- Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
- Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Lets-Plot to visualize the comparison. What do you learn?
Questions and Tasks (Stretch)
Here is an example Stretch question(s) for this project. Your instructor may assign different Stretch question(s). You must comment in Canvas when submitting your project if you completed any of the Stretch questions.
Advanced Salary Distribution by Position (with Case Statement):
Write an SQL query that provides a summary table showing the average salary for each position (e.g., pitcher, catcher, outfielder). Position information can be found in the fielding table in the POS column.
Include the following columns:
- position
- average_salary
- total_players
- highest_salary
The highest_salary column should display the highest salary ever earned by a player in that position.
Additionally, create a new column called salary_category using a case statement:
- If the average salary is above $3 million, categorize it as “High Salary.”
- If the average salary is between $2 million and $3 million, categorize it as “Medium Salary.”
- Otherwise, categorize it as “Low Salary.”
Order the table by average salary in descending order.
Hint: Beware, it is common for a player to play multiple positions in a single year. For this analysis, each player’s salary should only be counted toward one position in a given year: the position at which they played the most games that year. This will likely require a (sub-query)[https://docs.data.world/documentation/sql/concepts/advanced/WITH.html].
Advanced Career Longevity and Performance (with Subqueries):
Calculate the average career length (in years) for players who have played at least 10 games. Then, identify the top 10 players with the longest careers (based on the number of years they played). Include their:
- playerID
- first_name
- last_name
- career_length
The career_length should be calculated as the difference between the maximum and minimum yearID for each player.
Submission:
When you have completed the report, you will need to follow this process to submit your work:
- Have the
P2_template
open in VS Code and nothing else - Click
Preview Button
in VS Code in the top right of the screen- This will render the project but also entire course work portfolio into
HTML
files for review - Confirm everything displas as you would like it to
- How you see it will be how it is viewed for grading
- If there is an error in any cell of the quarto files, the rendering will stop and you will need to fix the error before rendering again (if you get stuck post your error in Slack)
- This will render the project but also entire course work portfolio into
- Once the report is confirmed close the preview and open a VS Code
Terminal
- click
Terminal
in the top menu bar and thenNew Terminal
- click
- Type the following in the terminal
quarto render
- Then drag and drop
P2_template.qmd
into the terminal this will add the file path to the terminal command - Press Enter
- Then drag and drop
- This will render the project into a
HTML
file in the same location as the.qmd
file - To locate the file in VS Code
- Right click on the file in the file explorer and select
Reveal in File Explorer
(Win) orReveal in Finder
(Mac)
- Right click on the file in the file explorer and select
- Upload
HTML
file into Canvas
Deliverables:
Use this P2_template to submit your Client Report.
- A short elevator pitch that highlights key values or metrics from the results. Describing these key insights to interest or hook the reader to want to read more about your work. The writing style should be more technical with some creative elements. Do not summarize what you did.
- Answers to the questions | tasks. Each should include a written description of your results, code cells with comments, charts and/or tables.
- A short summary of work must be submitted in the comments in Canvas wwhen you submit the URL. Rate your own work on a scale of 1-5. 1 being poor and 5 being excellent. Include a short description of why you rated your work the way you did.
Your report should be written in quarto markdown files and rendered to an HTML File. Upload the HTML file in Canvas. (Do not submit the .qmd
file)
Resubmission:
If you submit before the due date, you will have one opportunity to resubmit the project after you have received feedback. The window for the resubmission will be open through the Wednesday following the due date of the project.