Day 4 Seeing names with Altair

A few questions

Why are we using Altair?

It is built on the VEGA and D3 which are fast and web based.

Grammar of Graphics: Vega-Lite

Technical Paper
Website
Endorsment

What are we not learning in this course?

Indexing, `.loc[]` and `.iloc[]`

I may not be experienced enough to understand why I should teach you these. I think they all add complexity to what we are learning in the course and we have elected to avoid it. We will use reset_index() a lot. I think MultiIndex features create complication. I have also elected to use .filter() instead of .loc[] because I like it.

Virtual Environments

Virtual Environments appear to be an important tool as you continue to use Python. We will not be teaching these or supporting these in our course.

matplotlib (and any tool leveraging it)

It feels old, has a bad api, and isn’t declarative.

Why does this feel hard?

Because learning new tools is almost always confusing. I want to make sure you don’t drown, but I also don’t want you to think that you get a floaty for the rest of your life.

Class Activity

Get your files and folders setup to start working on the project.

What should we name our analysis folder and what files should we create in the folder?

data_science_programming > birth_names >
- names.py
- names.md
- notes.md
- data.csv (just in case the internet is down)

Create the folder and files to get prepared.

How should we start each file?

I would do this process for every project.

names.py

Every file starts with the same cells 1) import packages, 2) load data.

names.md

Let’s start with the course template

# Project Document Title

__<Author Name>__

## Elevator pitch

## TECHNICAL DETAILS

### GRAND QUESTION 1

### GRAND QUESTION 2

### GRAND QUESTION 3

...

## APPENDIX A (PYTHON SCRIPT)

notes.md

I would copy over the project information and then keep notes on the readings in that section.

Discovering a new data relationship.

Look at the names data and write a short paragraph in your notes describing it.

My answer

We have a row for each name-year. Excluding the name and year columns we have a column for each state and DC. Finally there is a Total column that sums over the other columns.

If you can’t describe what a row is in your table then you don’t understand what groups you can talk about with your data.
The columns tell you what information you will be able to evaluate on each ‘group’ or ‘observation’ in your data.
We want tidy data.

How many unique names do we have in our data?

pull the name column out as a series
Use the pandas unique function pd.unique()
find the size of the series

How many unique years do we have for our name?

write a query that filters to your name
pull the year column as a series
Find the max
Find the min
Find the number of unique years
Write a short sentence describing your results.

In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand. Note that the query() method also accepts the @ flag to mark local variables: jakedvp

Which name has been given the most and the least?

Let’s report it in a Markdown table.

Sum all the years for each name (groupby()).
Create a new DataFrame for the totals.
Write a query that filters the total data to the max and min.
Create a markdown table with the information.
A. to_markdown() requires the tabulate package.
B. to_markdown() with arguments showindex and floatformat
C. Guidance on floatformat