Welcome to class!
Announcements
Gratitude Journal
The Star Wars data
What does the data look like?
Take the time to understand how the current data is organized.
Each group should answer these questions:
- Where are the column names?
- What does each row represent?
- What does each column represent?
What do we want the data to look like?
Each group should answer these questions:
- What is the goal of this project, and how does that affect what we want from the data?
- What do we want each row to represent?
- What do we want each column to look like? Pick a few columns from the dataset and try creating an example in excel.
Cleaning data takes time
Maybe not 80% of your time, but it does take time!
Data science is frequently about doing bespoke analysis which means creating and labelling unique datasets. No matter how cleanly formatted or standardized a dataset is, it likely needs some work.
I would argue that spending time working with data to transform, explore and understand it better is absolutely what data scientists should be doing. This is the medium they are working in. Understand the material better and you’ll get better insights. ref
Structure your project, structure your thinking
Tableau on tidying data
- Think about your data holistically
- Know the basic structure of your data
- Keep track of your steps
- Spot check throughout
Compartmentalize and organize your scripts and data
- Best practices for organizing data science projects
- How to organize your Python data science project
- Cookiecutter Data Science
- Data Science Project Folder Structure
Load the Star Wars data
What happen when you run this code?
# %%
import pandas as pd
import altair as alt
import numpy as np
url = 'https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv'
dat = pd.read_csv(url)