Welcome to Class!

Announcements

Data Science Society Kickoff! Wednesday at 6 in the STC 394
The data science lab

Completing Last Week

Quarto - “out of the frying pan and into the fire”
Finishing the Introduction Project
- Use the QMD Template project template
- Render as HTML and upload in Canvas

What was that data science community portion of our grade?

The Syllabus has this section which says:

Data science community

To earn credit for the DS Community element you must complete two different tasks from the list below. At the end of the semester, you will be asked to report on which tasks you completed and what you learned from them.

Attend Data Science Society at least once.

Sign up for an email newsletter that will teach you more about data science. Data Science Weekly or Data Elixir are good options.
Listen to a podcast episode about data science. Build a Career in Data Science has some excellent episodes.
Watch a professional presentation on YouTube about data science. Be prepared to share the link and a summary of the video.
Reach out to someone who works in a data-related field and ask them for 15 minutes of their time. Use this time to conduct an “informational interview” and learn more about their responsibilities and career path.
Research and apply to at least 5 data-related jobs or internships.

Interview Question: How do you keep up with the current methods in data science?

Don’t Say: Nothing

Let’s Code!

DS 250 workflow

You are going to hit SHIFT + ENTER thousands of times.
We don’t usually source our scripts.
Think of Python Interactive like a graphing calculator or Excel on steroids.
You code in pieces.
Rewrite for clarity!

Can you figure out the functions of pandas?

Pandas Cheat Sheet and Basics Blog Post

# Pause: can you explain what this code is doing?
df = pd.DataFrame(
{"a" : [5, 4, 6, 2, 3],
"b" : [7, 8, 9, 10, 11],
"c" : [10, 11, 12, 101, 0]})

Use the cheat sheet to find the functions you would need to implement the following steps.

Group 1

sort my table by column a then
only use the first 2 rows then
calculate the mean of column b.

Group 2

rename column a to duck then
subset to only have duck and b columns then
keep all rows where b is less than 9 then
find the min of duck

What is method chaining?

Pandas is built to allow for method chaining. Here is a great resource on how to use method chaining: How to write neat pandas code.

plotly.express creates a chart object
pandas creates a DataFrame object
We usually include () around our entire method so we can show it in steps.

Project 1 - Intro

Understanding your data

You should be able to introduce your data sets to people, the same way you introduce a friend!

What does each row represent? If you don’t know, then you don’t understand what groups you can analyze.
What does each column represent? If you don’t know, then you don’t understand what information you can evaluate for each group.

Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.

Introduction to pandas “DataFrame”

What is a pandas DataFrame? We can read the official documentation. I also like the video in this tutorial.

DataFrames come with attributes and built-in functions that can help us get a feel for our data.

Run the code below one line at a time (or use other functions of your choice) to explore the names data. What do you learn?

my_data.columns
my_data.shape
my_data.size
my_data.head()
my_data.describe()

Setup for Project 1

Create the folder and files to get prepared.

DS250 > project_1 >
- names.py
- names.qmd
- data.csv (just in case the internet is down)

“How should we start each file?”

I would do this process for every project.

names.py: Every file starts with the same cells 1) import packages, 2) load data.
names.qmd: Let’s start with the course template
notes.qmd: Keep project noteson the readings and things you learn.
my_cheat_sheet.qmd: Update your own cheat sheet

Read in the data.

#%%
# load packages
import pandas as pd
import plotly.express as px

#%%
# load data
url = "https://github.com/byuidatascience/data4names/raw/master/data-raw/names_year/names_year.csv"
names = pd.read_csv(url)

1. How many unique names does the names dataframe contain? Work with a partner to find the answer. You might want to look at this pandas cheat sheet.

Hint

Pull the name column out as a series
Use the pandas unique function pd.unique()
Find the size of the series

2. What is the range of years in the names dataframe? Again, work with a partner and use the pandas cheat sheet.

Hint2

Pull the year column out as a series
Find the max
Find the min

Day 1: Exploring names with pandas

Welcome to Class!

Announcements

Completing Last Week

Let’s Code!

Project 1 - Intro

Understanding your data

Introduction to pandas “DataFrame”

Setup for Project 1

Create the folder and files to get prepared.

“How should we start each file?”