Welcome to class!
Gratitude Journal
Announcements
You need to have Slack open today!
Loading JSON files into pandas
Let’s load in some practice data! Data link.
import pandas as pd # to load and transform data
import numpy as np # for math/stat calculations
# from url to pandas dataframe
url = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json"
cars = pd.read_json(url)
# or from file to pandas dataframe
cars = pd.read_json("mtcars_missing.json")
Look at the data for the first two cars. What is different about the format?
[
{
"car": "Mazda RX4",
"mpg": 21,
"cyl": 6,
"disp": 160,
"hp": 110,
"drat": 3.9,
"wt": 2.62,
"qsec": 16.46,
"vs": 0,
"am": 1,
"gear": 4,
"carb": 4
},
{
"car": "Mazda RX4 Wag",
"mpg": 21,
"cyl": 6,
"disp": 160,
"hp": 110,
"drat": 3.9,
"wt": 2.875,
"qsec": 17.02,
"am": 1,
"gear": 4,
"carb": 4
}
]
Your Turn: Transforming Data
With your group, research these functions and create an example using the cars
data. Post your example in Slack. Be prepared to teach the class about your functions.
You can use the Data Transformation textbook chapter and the pandas documentation to help you.
Group 1 - Working with rows
.query()
allows you to subset observations (rows).sort_values()
arranges rows in a particular order
Group 2 - Working with columns
.filter()
(as well as[]
and.loc[]
) allow you to select columns.assign()
is one way to add new columns to a dataframe
Group 3 - Counting items
.value_counts()
summarizes a column by counting the values inside.crosstab()
creates a “cross tabulation” of two or more variables
Group 4 - Summarizing data
- Using
.groupby()
and.agg()
together allows you to calculate group summaries
Your Turn: Summarizing the cars data
Write code to calculate the mean weight wt
for each cylinder type cyl
.
cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index()
Can you print the answer as a markdown table?
print(cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index().to_markdown(index = False))
Project 2 FAQs
One main reason:
You can create multiple columns within the same
assign()
where one of the columns depends on another one defined within the same assign. source: Documentation
Other resources:
- Why use pandas.assign rather than simply initialize new column?
- 3 Ways to Add New Columns to Pandas Dataframe
Not related, but also fun: Should you use “dot notation” or “bracket notation” with pandas?
Two ways to define the same function:
def square(x):
return x**2
square = lambda x:x**2
There are some difference between them as listed below.
- lambda is a keyword that returns a function object and does not create a ‘name’. Whereas def creates name in the local namespace
- lambda functions are good for situations where you want to minimize lines of code as you can create function in one line of python code. It is not possible using def
- lambda functions are somewhat less readable for most Python users.
- lambda functions can only be used once, unless assigned to a variable name.
What if you want to create a new column, whose values depend on another column? There are a lot of ways to accomplish this (see this stackoverflow answer). Some functions I use:
- isin() method
- where() method
- You can also use an if else statement inside a lambda function
[]
or .query()
.