Day 1: The war with Star Wars

Welcome to class!

Announcements

Gratitude Journal


The Star Wars data

What does the data look like?

Take the time to understand how the current data is organized.

Each group should answer these questions:

  1. Where are the column names?
  2. What does each row represent?
  3. What does each column represent?

What do we want the data to look like?

Each group should answer these questions:

  1. What is the goal of this project, and how does that affect what we want from the data?
  2. What do we want each row to represent?
  3. What do we want each column to look like? Pick a few columns from the dataset and try creating an example in excel.

Cleaning data takes time

Maybe not 80% of your time, but it does take time!

Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardized a dataset is, it likely needs some work.

I would argue that spending time working with data to transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in. Understand the material better and you’ll get better insights. ref


Structure your project, structure your thinking

Tableau on tidying data

  1. Think about your data holistically
  2. Know the basic structure of your data
  3. Keep track of your steps
  4. Spot check throughout

Compartmentalize and organize your scripts and data


Load the Star Wars data

What happen when you run this code?

# %%
import pandas as pd 
import altair as alt
import numpy as np

url = 'https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv'

dat = pd.read_csv(url)


What are codecs and encodings?


The .str functions in pandas