Day 2: Transforming Data

Welcome to class!

Spiritual Thought

Announcements

  1. Code chunk options:
    • Locally using #| warning: false
    • Globally in the YAML using execute: warning: false

Flights Data Issues:

What are some of the data issues you discovered while getting to know your data?

Loading JSON files into pandas

Let’s load in some practice data! Data link.

Here’s a description of the data: Data Description.

import pandas as pd   # to load and transform data
import numpy as np    # for math/stat calculations

# from url to pandas dataframe
url = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json" 
cars = pd.read_json(url)

# or from file to pandas dataframe
cars = pd.read_json("mtcars_missing.json")

Look at the data for the first two cars. What is different about the format?

[
  {
    "car": "Mazda RX4",
    "mpg": 21,
    "cyl": 6,
    "disp": 160,
    "hp": 110,
    "drat": 3.9,
    "wt": 2.62,
    "qsec": 16.46,
    "vs": 0,
    "am": 1,
    "gear": 4,
    "carb": 4
  },
  {
    "car": "Mazda RX4 Wag",
    "mpg": 21,
    "cyl": 6,
    "disp": 160,
    "hp": 110,
    "drat": 3.9,
    "wt": 2.875,
    "qsec": 17.02,
    "am": 1,
    "gear": 4,
    "carb": 4
  }
]

Your Turn: Transforming Data

With your group, research these functions and create an example using the cars data. Post your example in Slack. Be prepared to teach the class about your functions.

You can use the Data Transformation textbook chapter and the pandas documentation to help you.

Recreate the following output to the best of your abilities:

Group 1: Working with rows
  • .query() allows you to subset observations (rows)
  • .sort_values() arranges rows in a particular order
Group 2: Working with columns
  • .filter() (as well as [] and .loc[]) allow you to select columns
  • .assign() is one way to add new columns to a dataframe
Group 3: Counting items
  • .value_counts() summarizes a column by counting the values inside
  • .crosstab() creates a “cross tabulation” of two or more variables
Group 4: Summarizing data
  • Using .groupby() and .agg() together allows you to calculate group summaries

Your Turn: Summarizing the cars data

Write code to calculate the mean weight wt for each cylinder type cyl.

cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index()

Can you print the answer as a markdown table?

print(cars.groupby('cyl').agg(mean_weight = ('wt', np.mean)).reset_index().to_markdown(index = False))

Project 2 FAQs

One main reason:

You can create multiple columns within the same assign() where one of the columns depends on another one defined within the same assign. source: Documentation

Other resources:

Not related, but also fun: Should you use “dot notation” or “bracket notation” with pandas?

Two ways to define the same function:

def square(x):
     return x**2

square = lambda x:x**2

There are some difference between them as listed below.

  1. lambda is a keyword that returns a function object and does not create a ‘name’. Whereas def creates name in the local namespace
  2. lambda functions are good for situations where you want to minimize lines of code as you can create function in one line of python code. It is not possible using def
  3. lambda functions are somewhat less readable for most Python users.
  4. lambda functions can only be used once, unless assigned to a variable name.

source

What if you want to create a new column, whose values depend on another column? There are a lot of ways to accomplish this (see this stackoverflow answer). Some functions I use:

We will learn how to identify and deal with missing data next week. For now, we can drop rows we don’t want using square brackets [] or .query().

API’s and JSON: A Primer

Application Programming Interfaces (APIs)

Representational State Transfer (REST APIs)

Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so too did they come to favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.”
Today, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref


JavaScript Object Notation

Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server.
In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web. ref


Other Resources