Day 1: The war with Star Wars

Welcome to class!

Spiritual Thought

Announcements

Next Thursday, 3/14, at 11:30 in RKS 229, John Stevens, Chair of the Department of Mathematics and Statistics at Utah State University, will speak on “Graduate School: What/Why/How, including at USU”.

After that (from 12:45 to 1:45), Dr. Stevens is providing a free catered lunch in the Manwaring Center for interested students. There is room for 15 students who are interested in learning more about a graduate degree in mathematics or statistics at USU. If you can come to lunch, please sign up here: Here (There is no cost for the lunch unless you are unable to attend and don’t let us know at least 5 hours in advance, in which case your BYU-I account will be billed for $17.)

Dr. Stevens has reserved some time between 2:00 and 3:00 to meet one-on-one with potential graduate students and talk about mathematics and statistics at USU. It would be a great time to meet with him, talk about the courses you’ve had and plan to take, and get tips on getting into graduate school. If you would like 10 minutes to chat with him, sign up here: Here

  1. Project 4 thoughts
    • Feature Importances - Sorted Bar Graph, not unsorted tables
    • And the winner is…

The Star Wars data

Load the Star Wars data

# %%
import pandas as pd 
import altair as alt
import numpy as np

url = 'https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv'

dat = pd.read_csv(url)


???

What do the data look like?

Take the time to understand how the current data is organized.

First things first…

Each group should answer these questions:

  1. Where are the column names?
  2. What does each row represent?
  3. What does each column represent?

What do we want the data to look like?

Each group should answer these questions:

  1. What is the goal of this project, and how does that affect what we want from the data?
  2. What do we want each row to represent?
  3. What do we want each column to look like? Pick a few columns from the dataset and try creating an example in excel.

Cleaning data takes time

Maybe not 80% of your time, but it does take time!

Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardized a dataset is, it likely needs some work.

I would argue that spending time working with data to transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in. Understand the material better and you’ll get better insights. ref


Structure your project, structure your thinking

Tableau on tidying data

  1. Think about your data holistically
  2. Know the basic structure of your data
  3. Keep track of your steps
  4. Spot check throughout

Compartmentalize and organize your scripts and data


What are codecs and encodings?


The .str functions in pandas