Welcome to Class!
Announcements
- Data Science Society Kickoff! Wednesday at 6 in the STC 394
- The data science lab
Completing Last Week
- Quarto - “out of the frying pan and into the fire”
- Finishing the Introduction Project
- Use the QMD Template project template
- Render as HTML and upload in Canvas
The Syllabus has this section which says:
Data science community
To earn credit for the DS Community element you must complete two different tasks from the list below. At the end of the semester, you will be asked to report on which tasks you completed and what you learned from them.
Attend Data Science Society at least once.
- Sign up for an email newsletter that will teach you more about data science. Data Science Weekly or Data Elixir are good options.
- Listen to a podcast episode about data science. Build a Career in Data Science has some excellent episodes.
- Watch a professional presentation on YouTube about data science. Be prepared to share the link and a summary of the video.
- Reach out to someone who works in a data-related field and ask them for 15 minutes of their time. Use this time to conduct an “informational interview” and learn more about their responsibilities and career path.
- Research and apply to at least 5 data-related jobs or internships.
Interview Question: How do you keep up with the current methods in data science?
Don’t Say: Nothing
Let’s Code!
- You are going to hit
SHIFT + ENTER
thousands of times. - We don’t usually source our scripts.
- Think of Python Interactive like a graphing calculator or Excel on steroids.
- You code in pieces.
- Rewrite for clarity!
# Pause: can you explain what this code is doing?
df = pd.DataFrame(
{"a" : [5, 4, 6, 2, 3],
"b" : [7, 8, 9, 10, 11],
"c" : [10, 11, 12, 101, 0]})
Use the cheat sheet to find the functions you would need to implement the following steps.
Group 1
- sort my table by column
a
then - only use the first 2 rows then
- calculate the mean of column
b
.
Group 2
- rename column
a
toduck
then - subset to only have
duck
andb
columns then - keep all rows where
b
is less than 9 then - find the min of
duck
Pandas is built to allow for method chaining. Here is a great resource on how to use method chaining: How to write neat pandas code.
- plotly.express creates a chart object
- pandas creates a DataFrame object
- We usually include
()
around our entire method so we can show it in steps.
Project 1 - Intro
Understanding your data
You should be able to introduce your data sets to people, the same way you introduce a friend!
- What does each row represent? If you don’t know, then you don’t understand what groups you can analyze.
- What does each column represent? If you don’t know, then you don’t understand what information you can evaluate for each group.
Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.
Introduction to pandas “DataFrame”
What is a pandas DataFrame? We can read the official documentation. I also like the video in this tutorial.
DataFrames come with attributes and built-in functions that can help us get a feel for our data.
Run the code below one line at a time (or use other functions of your choice) to explore the names
data. What do you learn?
my_data.columns
my_data.shape
my_data.size
my_data.head()
my_data.describe()
Setup for Project 1
Create the folder and files to get prepared.
DS250 > project_1 >
names.py
names.qmd
data.csv
(just in case the internet is down)
“How should we start each file?”
I would do this process for every project.
- names.py: Every file starts with the same cells 1) import packages, 2) load data.
- names.qmd: Let’s start with the course template
- notes.qmd: Keep project noteson the readings and things you learn.
- my_cheat_sheet.qmd: Update your own cheat sheet
Read in the data.
#%%
# load packages
import pandas as pd
import plotly.express as px
#%%
# load data
url = "https://github.com/byuidatascience/data4names/raw/master/data-raw/names_year/names_year.csv"
names = pd.read_csv(url)
1. How many unique names does the names
dataframe contain? Work with a partner to find the answer. You might want to look at this pandas cheat sheet.
- Pull the name column out as a series
- Use the pandas unique function
pd.unique()
- Find the size of the series
2. What is the range of years in the names
dataframe? Again, work with a partner and use the pandas cheat sheet.
- Pull the year column out as a series
- Find the max
- Find the min