Day 1: Exploring names with pandas

Welcome to Class!

Gratitude Journal

Announcements

Project 1

Import packages

import ??? as ???

Load the names data

my_data = pd.read_csv()

Understanding your data

You should be able to introduce your data sets to people, the same way you introduce a friend!

What does each row represent? If you don’t know, then you don’t understand what groups you can analyze.
What does each column represent? If you don’t know, then you don’t understand what information you can evaluate for each group.

Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.

Introduction to pandas “DataFrame”

What is a pandas DataFrame? We can read the official documentation. I also like the video in this tutorial.

DataFrames come with attributes and built-in functions that can help us get a feel for our data.

Run the code below one line at a time (or use other functions of your choice) to explore the names data. What do you learn?

my_data.columns
my_data.shape
my_data.size
my_data.head()
my_data.describe()

Let’s practice!

1. How many unique names does the names dataframe contain? Work with a partner to find the answer. You might want to look at this pandas cheat sheet.

Hint

Pull the name column out as a series
Use the pandas unique function pd.unique()
Find the size of the series

2. What is the range of years in the names dataframe? Again, work with a partner and use the pandas cheat sheet.

Hint

Pull the year column out as a series
Find the max
Find the min

Extra Practice

CSE 250 workflow

You are going to hit SHIFT + ENTER thousands of times.
We don’t usually source our scripts.
Think of Python Interactive like a graphing calculator or Excel on steroids.
You code in pieces.
Rewrite for clarity!

Can you figure out the functions of pandas?

Pandas Cheat Sheet and Basics Blog Post

# Pause: can you explain what this code is doing?
df = pd.DataFrame(
{"a" : [5, 4, 6, 2, 3],
"b" : [7, 8, 9, 10, 11],
"c" : [10, 11, 12, 101, 0]})

Use the cheat sheet to find the functions you would need to implement the following steps.

I want to:

sort my table by column a (low to high)
only keep the first two rows
calculate the mean of column b

What is method chaining?

Pandas and Altiar are built to allow for method chaining. Here is a great resource on how to use method chaining: How to write neat pandas code.

Altair creates a chart object
pandas creates a DataFrame object
We usually include () around our entire method so we can show it in steps.

# read in data
flights_url = "https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/flights/flights.csv"
flights = pd.read_csv(flights_url)
flights['time_hour'] = pd.to_datetime(flights.time_hour, format = "%Y-%m-%d %H:%M:%S")

# without method chaining
flights = flights.filter(['dep_time'])
flights = flights.assign(hour = lambda x: x.dep_time // 100)
flights = flights.assign(minute = lambda x: x.dep_time % 100)
flights.head(5)

# with method chaining
(flights
    .filter(['dep_time'])
    .assign(
      hour = lambda x: x.dep_time // 100,
      minute = lambda x: x.dep_time % 100
      )
    .head(5)
)

Your turn

df = pd.DataFrame(
{"a" : [5, 4, 6, 2, 3],
"b" : [7, 8, 9, 10, 11],
"c" : [10, 11, 12, 101, 0]})

Use method chaining to do all of the following steps:

rename column a to duck, then
subset to only have duck and b columns, then
keep all rows where b is greater than 9, then
find the min of duck