Day 7: Is JSON missing?

Let’s take five minutes and share our previous project with a partner

  1. Compare your charts
  2. Share your code and explain one part that was challenging
  3. Discuss a resources that was helpful for you to complete the challenge

Read the project overview and Questions for understanding.

Data has structure

  1. Was our baby names data tidy?

Connecting to Application Programming Interfaces (APIs)

Representational State Transfer (REST APIs)

Over the course of the ’00s, another Web services technology, called Representational State Transfer, or REST, began to overtake [all other tools] for the purpose of transferring data. One of the big advantages of programming using REST APIs is that you can use multiple data formats — not just XML, but JSON and HTML as well. As web developers came to prefer JSON over XML, so too did they come to favor REST over SOAP. As Kostyantyn Kharchenko put it on the Svitla blog, “In many ways, the success of REST is due to the JSON format because of its easy use on various platforms.”
Today, JSON is the de-facto standard for exchanging data between web and mobile clients and back-end services. ref

Graph Qury Language (GraphQL APIs)

GraphQL on the other hand is a query language which gives the client the power to request specific fields and elements it wants to retrieve from the server. It is, loosely speaking, some kind of SQL for the Web. It therefore has to have knowledge on the available data beforehand which couples clients somehow to the server. ref and another reference

JavaScript Object Notation

Well, when you’re writing frontend code in Javascript, getting JSON data back makes it easier to load that data into an object tree and work with it. And JSON formats data in a more succinct way, which saves bandwidth and improves response times when sending messages back and forth to a server.
In a world of APIs, cloud computing, and ever-growing data, JSON has a big role to play in greasing the wheels of a modern, open web.ref

Handling JSON data in Python

Dealing with the World Wide Web

Web requests in Python

Internal Packages

External Packages

Our path

urllib3: https://urllib3.readthedocs.io/en/latest/user-guide.html

# internal packages
import urllib3 
import json

url = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json"

# %%
http = urllib3.PoolManager()
response = http.request('GET', url)
cars_json = json.loads(response.data.decode('utf-8'))

requests: https://requests.readthedocs.io/en/master/

# in the interactive window
import sys
!{sys.executable} -m pip install requests
# external package
import requests

url = "https://github.com/byuidatascience/data4missing/raw/master/data-raw/mtcars_missing/mtcars_missing.json"

# %%
resp_req = requests.get(url)
cars_json_req = resp_req.json()

JSON for data

  • What do we notice is different about the first two cars?
  • Why do JSON files seem optimal for data sharing?
  • Compare and contrast .csv and .json format benefits.

Motor Trends Car Road Tests data with missing values

[
  {
    "car": "Mazda RX4",
    "mpg": 21,
    "cyl": 6,
    "disp": 160,
    "hp": 110,
    "drat": 3.9,
    "wt": 2.62,
    "qsec": 16.46,
    "vs": 0,
    "am": 1,
    "gear": 4,
    "carb": 4
  },
  {
    "car": "Mazda RX4 Wag",
    "mpg": 21,
    "cyl": 6,
    "disp": 160,
    "hp": 110,
    "drat": 3.9,
    "wt": 2.875,
    "qsec": 17.02,
    "am": 1,
    "gear": 4,
    "carb": 4
  },
  {
    "car": "Datsun 710",
    "mpg": 22.8,
    "cyl": 4,
    "disp": 108,
    "hp": 93,
    "drat": 3.85,
    "wt": 2.32,
    "qsec": 18.61,
    "vs": 1,
    "am": 1,
    "gear": 999,
    "carb": 1
  }
]

JSON to DataFrame

# cars = pd.DataFrame.from_dict(cars_json)
cars = pd.json_normalize(cars_json) # handles nested jsons.

We can then handle and program nested JSON formats.

data = [{'id': 1,
         'name': "Cole Volk",
         'fitness': {'height': 130, 'weight': 60}},
        {'name': "Mose Reg",
         'fitness': {'height': 130, 'weight': 60}},
        {'id': 2, 'name': 'Faye Raker',
         'fitness': {'height': 130, 'weight': 60}}]
pd.json_normalize(data, max_level=0)

What is missing data?

How pandas handles missingness

Read ‘Handling missing in pandas’

df = (pd.DataFrame(
    np.random.randn(5, 3), 
    index=['a', 'c', 'e', 'f', 'h'],
    columns=['one', 'two', 'three'])
  .assign(
    four = 'bar', 
    five = lambda x: x.one > 0,
    six = [np.nan, np.nan, 2, 2, 1],
    seven = [4, 5, 5, np.nan, np.nan])
  )