Pandas and LetsPlot
Skill Builder
Pandas and LetsPlot
For this skill builder, we are exploring some important functions in the package of pandas and LetsPlot. DS programming requires a lot of data wrangling. Using the proper functions, we can create concise and comprehensive codes. You should be exposed to a few functions through the readings this week.
You may want to at least scan the readings before beginning this task since this serves as an assessment of your understanding of the assigned readings. This should be able to be finished within 60 minutes. You should work through it on your own or in a group based in your professors instruction.
Data Import
Run the following code to import the data we need for this skill builder:
# package import
import numpy as np
import pandas as pd
from lets_plot import *
=True)
LetsPlot.setup_html(isolated_frame
# data import
= 'https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/AER/Guns.csv'
url = pd.read_csv(url) df
Make sure the variable df
is correctly assigned in your environment and finish the following exercises. You can read the documentation of the data on this page - https://vincentarelbundock.github.io/Rdatasets/doc/AER/Guns.html
Exercise 1
One of the first things we can do to a freshly imported data is to check its columns. This will help us understand the basic structure of the dataframe(table).
Using one line of code, select all the columns in dat, assign it to a variable called col_list.
Every dataframe has an attribute “columns”.
Accessing this attribute will give you a list of all column names
We often want to know the dimension of a dataframe. How many columns are in the dataset? How many rows are in the dataset?
Using one line of code, show the number of columns and rows in
df
.
Every dataframe has an attribute “shape”.
Accessing this attribute will give you the dimension of a datafarme
Now run df.head()
. It will print out the first 5 rows of data in df
.
Just from looking at the output, what column(s) seems to be redundant with the row number?
There is one column that serves as nothing but a row counter, that columns is redundant.
Exercise 2
After a brief investigation of the data, we will clean up the data. By cleaning up, we are trying to filter down df
so this only holds data we need. We will first get rid of the extra column we found in the previous excercise.
Using one line of code, drop the redundant column using the variable col_list (created in excercise 1)
Use drop()
.
Understand what “axis” is as a parameter of drop()
.
Your function should looks like this:
df.drop([col_list[_]], axis = _)
fill the “_“’s with the correct values and assign the output to df
.
Don’t forget to save the changes in df
. Run df.head()
to make sure the column is dropped in df
.
Exercise 3
We have filtered df
vertically by dropping a column. Now we will try to filter df
horizontally, meaning we will get rid of some the rows.
We can do that by applying a condition to df
. A condition is an expression that can be evaluated as True
/False
. For example, 8 > 5
is an expression that evaluates to be True. This is trivial because 8 will always be greater than 5.
Run the code below:
what is the difference between exp1 and exp2?
= 8 > 5
exp1 = df.violent < 300 exp2
Try type() on else variable OR calling else variable.
Run ths code below:
By putting
df.violent < 300
, and the violent column fromdf
into a dataframe, what is the relationship between the two columns?
= pd.DataFrame({"df.violent < 300" : exp2,
exp "violent value from dat" : df.violent})
exp
Try computing df.violent[n] < 300
and (df.violent < 300)[n]
where n is less than the number of row. The two expressions will always be the same as long as n is less than the number of rows.
Using
query()
to filter down thedf
so that it only contains the data for idaho
query() takes in expressions and filters down data.
Don’t forget to save the changes in df
. Run df.shape()
to make sure the there are 23 rows and 13 columns.
Exercise 4
Besides filtering, we can manipulate the data by adding new data to it. By adding a new column to the data, we assign a new value to each row.
Using
assign()
, create a new column that show the ratio between murder rate and violent rate.
Use assign()
You see get the ratio by computing this code:
df.murder/df.violent
Exercise 5
Create a scatter plot that shows the relationship between murder rate and violent rate for the state of Idaho. Your chart should show murder rate as the x-axis, violent as the y-axis.
Can you mimic this plot while using LetsPlot? https://lets-plot.org/
Because You’re Extra
Exercise 6
Using a line of code, filter down the data set so that it only shows the data in years between 1993 and 1997.
Exercise 7
Create a line chart that show prisoners numbers for the state of Idaho, Utah, and Oregon.
Your chart should show year as the x-axis, prisoner as the y-axis, states as different colours, along with an appropriate title.
Exercise 8
Without using
query()
, finshed the data wrangling in question 2,5 and 6.
See the script.