Day 2: Transforming Data

Welcome to class!

Gratitude Journal

Announcements

You need to have Slack open today!


Loading JSON files into pandas

Let’s load in some practice data! Data link.

import pandas as pd   # to load and transform data
import numpy as np    # for math/stat calculations

# from url to pandas dataframe
url = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json" 
cars = pd.read_json(url)

# or from file to pandas dataframe
cars = pd.read_json("mtcars_missing.json")

Look at the data for the first two cars. What is different about the format?

[
  {
    "car": "Mazda RX4",
    "mpg": 21,
    "cyl": 6,
    "disp": 160,
    "hp": 110,
    "drat": 3.9,
    "wt": 2.62,
    "qsec": 16.46,
    "vs": 0,
    "am": 1,
    "gear": 4,
    "carb": 4
  },
  {
    "car": "Mazda RX4 Wag",
    "mpg": 21,
    "cyl": 6,
    "disp": 160,
    "hp": 110,
    "drat": 3.9,
    "wt": 2.875,
    "qsec": 17.02,
    "am": 1,
    "gear": 4,
    "carb": 4
  }
]

Your Turn: Transforming Data

With your group, research these functions and create an example using the cars data. Post your example in Slack. Be prepared to teach the class about your functions.

You can use the Data Transformation textbook chapter and the pandas documentation to help you.

Group 1 - Working with rows
  • .query() allows you to subset observations (rows)
  • .sort_values() arranges rows in a particular order
Group 2 - Working with columns
  • .filter() (as well as [] and .loc[]) allow you to select columns
  • .assign() is one way to add new columns to a dataframe
Group 3 - Counting items
  • .value_counts() summarizes a column by counting the values inside
  • .crosstab() creates a “cross tabulation” of two or more variables
Group 4 - Summarizing data
  • Using .groupby() and .agg() together allows you to calculate group summaries

Your Turn: Summarizing the cars data

Write code to calculate the mean weight wt for each cylinder type cyl.

cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index()

Can you print the answer as a markdown table?

print(cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index().to_markdown(index = False))



Project 2 FAQs

One main reason:

You can create multiple columns within the same assign() where one of the columns depends on another one defined within the same assign. source: Documentation

Other resources:

Not related, but also fun: Should you use “dot notation” or “bracket notation” with pandas?

Two ways to define the same function:

def square(x):
     return x**2

square = lambda x:x**2

There are some difference between them as listed below.

  1. lambda is a keyword that returns a function object and does not create a ‘name’. Whereas def creates name in the local namespace
  2. lambda functions are good for situations where you want to minimize lines of code as you can create function in one line of python code. It is not possible using def
  3. lambda functions are somewhat less readable for most Python users.
  4. lambda functions can only be used once, unless assigned to a variable name.

source

What if you want to create a new column, whose values depend on another column? There are a lot of ways to accomplish this (see this stackoverflow answer). Some functions I use:

We will learn how to identify and deal with missing data next week. For now, we can drop rows we don’t want using square brackets [] or .query().