Day 4 Seeing names with Altair

A few questions

Why are we using Altair?

It is built on the VEGA and D3 which are fast and web based.

Grammar of Graphics: Vega-Lite

What are we not learning in this course?

Indexing, .loc[] and .iloc[]

I may not be experienced enough to understand why I should teach you these. I think they all add complexity to what we are learning in the course and we have elected to avoid it. We will use reset_index() a lot. I think MultiIndex features create complication. I have also elected to use .filter() instead of .loc[] because I like it.

Virtual Environments

Virtual Environments appear to be an important tool as you continue to use Python. We will not be teaching these or supporting these in our course.

matplotlib (and any tool leveraging it)

It feels old, has a bad api, and isn’t declarative.

Why does this feel hard?

Because learning new tools is almost always confusing. I want to make sure you don’t drown, but I also don’t want you to think that you get a floaty for the rest of your life.

Class Activity

Get your files and folders setup to start working on the project.

  • data_science_programming > birth_names >
    • names.py
    • names.md
    • notes.md
    • data.csv (just in case the internet is down)

Create the folder and files to get prepared.

I would do this process for every project.

names.py

Every file starts with the same cells 1) import packages, 2) load data.

names.md

Let’s start with the course template

# Project Document Title

__<Author Name>__

## Elevator pitch

## TECHNICAL DETAILS

### GRAND QUESTION 1

### GRAND QUESTION 2

### GRAND QUESTION 3

...

## APPENDIX A (PYTHON SCRIPT)

notes.md

I would copy over the project information and then keep notes on the readings in that section.

Discovering a new data relationship.

  1. Look at the names data and write a short paragraph in your notes describing it.

We have a row for each name-year. Excluding the name and year columns we have a column for each state and DC. Finally there is a Total column that sums over the other columns.

  • If you can’t describe what a row is in your table then you don’t understand what groups you can talk about with your data.
  • The columns tell you what information you will be able to evaluate on each ‘group’ or ‘observation’ in your data.
  • We want tidy data.
  1. pull the name column out as a series
  2. Use the pandas unique function pd.unique()
  3. find the size of the series
  1. write a query that filters to your name
  2. pull the year column as a series
  3. Find the max
  4. Find the min
  5. Find the number of unique years
  6. Write a short sentence describing your results.

In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand. Note that the query() method also accepts the @ flag to mark local variables: jakedvp

Let’s report it in a Markdown table.

  1. Sum all the years for each name (groupby()).
  2. Create a new DataFrame for the totals.
  3. Write a query that filters the total data to the max and min.
  4. Create a markdown table with the information.
    A. to_markdown() requires the tabulate package.
    B. to_markdown() with arguments showindex and floatformat
    C. Guidance on floatformat