Welcome to Class!
Gratitude Journal
Announcements
The data science lab opens this week!
Project 1
Import packages
import ??? as ???
Load the names data
my_data = pd.read_csv()
Understanding your data
You should be able to introduce your data sets to people, the same way you introduce a friend!
- What does each row represent? If you don’t know, then you don’t understand what groups you can analyze.
- What does each column represent? If you don’t know, then you don’t understand what information you can evaluate for each group.
Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.
Introduction to pandas “DataFrame”
What is a pandas DataFrame? We can read the official documentation. I also like the video in this tutorial.
DataFrames come with attributes and built-in functions that can help us get a feel for our data.
Run the code below one line at a time (or use other functions of your choice) to explore the names
data. What do you learn?
my_data.columns
my_data.shape
my_data.size
my_data.head()
my_data.describe()
Let’s practice!
1. How many unique names does the names
dataframe contain? Work with a partner to find the answer. You might want to look at this pandas cheat sheet.
- Pull the name column out as a series
- Use the pandas unique function
pd.unique()
- Find the size of the series
2. What is the range of years in the names
dataframe? Again, work with a partner and use the pandas cheat sheet.
- Pull the year column out as a series
- Find the max
- Find the min
Extra Practice
- You are going to hit
SHIFT + ENTER
thousands of times. - We don’t usually source our scripts.
- Think of Python Interactive like a graphing calculator or Excel on steroids.
- You code in pieces.
- Rewrite for clarity!
# Pause: can you explain what this code is doing?
df = pd.DataFrame(
{"a" : [5, 4, 6, 2, 3],
"b" : [7, 8, 9, 10, 11],
"c" : [10, 11, 12, 101, 0]})
Use the cheat sheet to find the functions you would need to implement the following steps.
I want to:
- sort my table by column
a
(low to high) - only keep the first two rows
- calculate the mean of column
b
Pandas and Altiar are built to allow for method chaining. Here is a great resource on how to use method chaining: How to write neat pandas code.
- Altair creates a chart object
- pandas creates a DataFrame object
- We usually include
()
around our entire method so we can show it in steps.
# read in data
flights_url = "https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/flights/flights.csv"
flights = pd.read_csv(flights_url)
flights['time_hour'] = pd.to_datetime(flights.time_hour, format = "%Y-%m-%d %H:%M:%S")
# without method chaining
flights = flights.filter(['dep_time'])
flights = flights.assign(hour = lambda x: x.dep_time // 100)
flights = flights.assign(minute = lambda x: x.dep_time % 100)
flights.head(5)
# with method chaining
(flights
.filter(['dep_time'])
.assign(
hour = lambda x: x.dep_time // 100,
minute = lambda x: x.dep_time % 100
)
.head(5)
)
df = pd.DataFrame(
{"a" : [5, 4, 6, 2, 3],
"b" : [7, 8, 9, 10, 11],
"c" : [10, 11, 12, 101, 0]})
Use method chaining to do all of the following steps:
- rename column
a
toduck,
then - subset to only have
duck
andb
columns, then - keep all rows where
b
is greater than 9, then - find the min of
duck