Day 1: Exploring names with pandas

Welcome to Class!

Announcements


Completing Last Week

  1. Quarto - “out of the frying pan and into the fire”
  2. Finishing the Introduction Project

The Syllabus has this section which says:

Data science community

To earn credit for the DS Community element you must complete two different tasks from the list below. At the end of the semester, you will be asked to report on which tasks you completed and what you learned from them.

Attend Data Science Society at least once.

  • Sign up for an email newsletter that will teach you more about data science. Data Science Weekly or Data Elixir are good options.
  • Listen to a podcast episode about data science. Build a Career in Data Science has some excellent episodes.
  • Watch a professional presentation on YouTube about data science. Be prepared to share the link and a summary of the video.
  • Reach out to someone who works in a data-related field and ask them for 15 minutes of their time. Use this time to conduct an “informational interview” and learn more about their responsibilities and career path.
  • Research and apply to at least 5 data-related jobs or internships.

Interview Question: How do you keep up with the current methods in data science?

Don’t Say: Nothing

Let’s Code!

  • You are going to hit SHIFT + ENTER thousands of times.
  • We don’t usually source our scripts.
  • Think of Python Interactive like a graphing calculator or Excel on steroids.
  • You code in pieces.
  • Rewrite for clarity!
# Pause: can you explain what this code is doing?
df = pd.DataFrame(
{"a" : [5, 4, 6, 2, 3],
"b" : [7, 8, 9, 10, 11],
"c" : [10, 11, 12, 101, 0]})

Use the cheat sheet to find the functions you would need to implement the following steps.

Group 1

  1. sort my table by column a then
  2. only use the first 2 rows then
  3. calculate the mean of column b.

Group 2

  1. rename column a to duck then
  2. subset to only have duck and b columns then
  3. keep all rows where b is less than 9 then
  4. find the min of duck

Pandas is built to allow for method chaining. Here is a great resource on how to use method chaining: How to write neat pandas code.

  • plotly.express creates a chart object
  • pandas creates a DataFrame object
  • We usually include () around our entire method so we can show it in steps.

Project 1 - Intro

Understanding your data

You should be able to introduce your data sets to people, the same way you introduce a friend!

  • What does each row represent? If you don’t know, then you don’t understand what groups you can analyze.
  • What does each column represent? If you don’t know, then you don’t understand what information you can evaluate for each group.

Being able to explain your data out loud to someone else follows the same principles as rubber duck debugging.


Introduction to pandas “DataFrame”

What is a pandas DataFrame? We can read the official documentation. I also like the video in this tutorial.

DataFrames come with attributes and built-in functions that can help us get a feel for our data.

Run the code below one line at a time (or use other functions of your choice) to explore the names data. What do you learn?

my_data.columns
my_data.shape
my_data.size
my_data.head()
my_data.describe()

Setup for Project 1

Create the folder and files to get prepared.

  • DS250 > project_1 >
    • names.py
    • names.qmd
    • data.csv (just in case the internet is down)

“How should we start each file?”

I would do this process for every project.

  • names.py: Every file starts with the same cells 1) import packages, 2) load data.
  • names.qmd: Let’s start with the course template
  • notes.qmd: Keep project noteson the readings and things you learn.
  • my_cheat_sheet.qmd: Update your own cheat sheet

Read in the data.

#%%
# load packages
import pandas as pd
import plotly.express as px

#%%
# load data
url = "https://github.com/byuidatascience/data4names/raw/master/data-raw/names_year/names_year.csv"
names = pd.read_csv(url)

1. How many unique names does the names dataframe contain? Work with a partner to find the answer. You might want to look at this pandas cheat sheet.

  1. Pull the name column out as a series
  2. Use the pandas unique function pd.unique()
  3. Find the size of the series

2. What is the range of years in the names dataframe? Again, work with a partner and use the pandas cheat sheet.

  1. Pull the year column out as a series
  2. Find the max
  3. Find the min