Day 3: Learning names with pandas

Completing Last Week

  1. Markdown Preview Enhanced Install & Manual
  2. Stop saving files everywhere! Structure your folders. Example 1, Example 2.
  3. Finishing the Introduction Project
    A. data_science_programming > introduction
    B. In introduction I will save my .py, .md, and any .png files that are created with my .py file.
    C. Let’s use the project template
    D. Now lets use Markdown Preview Enhanced to finish our introduction project.
    E. Now submit it in Canvas.

The Syllabus has this section which says;

Data science community

  1. Attend data science society at least once during the semester.
  2. Register to get a regular email on topics related to data science.

Interview Question: What do you do to stay up with the current methods in data science?

Don’t Say: Nothing

Register for a newsletter

Understanding the power of pandas

The data science workflow

  • You are going to hit SHIFT + ENTER thousands of times.
  • We don’t usually source our scripts.
  • Think of Python Interactive like a TI-86 or Excel on steroids.
  • You code in pieces.
  • Rewrite for clarity!
df = pd.DataFrame(
{"a" : [4 ,5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]})
# Can someone read this code in english?

Use the cheat sheet to find the functions you would need to implement the following steps.

I want to;

  1. sort my table by column a then
  2. only use the first 2 rows then
  3. calculate the mean of column b.

I want to;

  1. rename column a to duck then
  2. subset to only have duck and b columns then
  3. keep all rows where b is less than 9 then
  4. find the min of duck

Pandas and Altiar are built to allow for method chaining.

  • Altair is a chart object
  • pandas is a DataFrame object
  • We usually include () around our entire method so we can show it in steps.
flights_url = "https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/flights/flights.csv"
flights = pd.read_csv(flights_url)
flights['time_hour'] = pd.to_datetime(flights.time_hour, format = "%Y-%m-%d %H:%M:%S")

(flights
    .filter(['dep_time'])
    .assign(
      hour = lambda x: x.dep_time // 100,
      minute = lambda x: x.dep_time % 100
      ))
url = "https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/mpg/mpg.csv"

mpg = pd.read_csv(url)

chart_loess = (alt.Chart(mpg)
  .encode(
    x = "displ",
    y = "hwy")
  .transform_loess("displ", "hwy")
  .mark_line()
)

chart_loess