On this page
Project 2 WorkBook
Tutoring Lab Info
The data science lab is a resource you can use in person, online, and in Slack.
Data Manipulation
1. replace()
function
The replace()
function is used to substitute specified values in a DataFrame or Series, facilitating data cleaning and transformation by exchanging specific values with desired ones. In the example, old_value
denotes the value to be replaced, and new_value
signifies the replacement value. The presence of inplace=True
ensures that the modification is applied directly to the DataFrame.
2. min()
function
The min()
function is employed to determine the minimum value in a DataFrame or Series. It is particularly useful for extracting the smallest element from a numerical dataset.
3. max()
function
The max()
function is utilized to find the maximum value in a DataFrame or Series. This function is handy for retrieving the largest element from a numerical dataset.
4. groupby()
function
The groupby()
function is applied to split the data into groups based on some criteria. It is often used in conjunction with aggregate functions like sum()
to perform operations on each group independently.
5. sum()
function
The sum()
function is used to calculate the sum of values in a DataFrame or Series. It is commonly employed with the groupby()
function to find the total of specific groups in the data.
6. index
attribute
The index
attribute is employed to access the index (row labels) of a DataFrame. It is useful for extracting information related to the row labels and their positions.
7. round()
function
The round()
function is utilized to round off numeric values to a specified number of decimal places. It is helpful for controlling the precision of numerical data in a DataFrame.
8. Indexing a String
To index a string in a DataFrame, the square brackets []
are used along with the column name containing the string data. This allows for accessing and manipulating specific characters or substrings within the string.
9. to_json()
function
The to_json()
function is employed to convert a DataFrame to its JSON representation. The orient
parameter determines the format of the JSON output, such as ‘split’, ‘records’, ‘index’, etc.
10. isin()
function
The isin()
function is used to filter data based on a list of values. It returns a boolean Series indicating whether each element in the Series is contained in the specified list.
11. all()
function
The all()
function is employed to check if all elements in a boolean Series are True. It is commonly used in conjunction with conditions to evaluate whether a certain condition holds for all elements.
12. fillna()
function
The fillna()
function is used to fill missing (NaN) values in a DataFrame with a specified value or using a specified method. It helps in handling missing data by providing a way to replace or impute missing values.
13. mean()
function
The mean()
function calculates the mean (average) of values in a DataFrame or Series. It is useful for obtaining the central tendency of numerical data, providing insights into the data distribution.
Getting started with Lets-Plot
What makes a chart look good?
Note: The following code snippets use Plotly Express for demonstartion. The course uses Lets-Plot.
Plotly Chart Structure
Size of Chart
Width and Height
Show the code
The width
and height
parameters in the update_layout
method are used to set the width and height of the plot in pixels, respectively. In this example, the width is set to 600 pixels, and the height is also set to 600 pixels. Adjust these values according to your desired dimensions for the scatter plot.
Title and Subtitle
Title
Show the code
The title_text
parameter in the update_layout
method is used to set the title of the plot. In this example, the title is set to “Bar Chart Example”. You can customize the title by changing the value assigned to title_text
to better describe the content or purpose of your bar chart.
Title and subtitle - Title w/ Subtitle 1
Show the code
fig = px.bar(data, x='cty', y='hwy')
fig.update_layout(
title_text="City vs Highway MPG Bar Chart",
title_font=dict(color="red"),
title_font_size=18,
title_y=0.95,
title_x=0.5
)
fig.add_annotation(
text="Your Annotation Text",
font=dict(color="blue"), # Choose your desired font color
x=0.5,
y=0.9,
showarrow=False
)
fig.show()
The title_text
parameter in the update_layout
method is used to set the main title of the plot, and additional parameters like title_font
, title_font_size
, title_y
, and title_x
are used to customize the appearance and position of the title.
The add_annotation
method is used to add an annotation or subtitle to the plot. In this example, it adds a blue text annotation with the content “Your Annotation Text” at a specified position (x=0.5, y=0.9) relative to the plot.
Adjust the values of these parameters to customize the appearance and position of the title and annotation according to your preferences.
Title and subtitle - Title w/ Subtitle 2
Show the code
The title_text
parameter in the update_layout
method is used to set the main title of the plot, and additional parameters like title_font_size
, title_y
, and title_x
are used to customize the appearance and position of the title.
The add_annotation
method is used to add an annotation or subtitle to the plot. In this example, it adds a text annotation with the content “Your Annotation Text” at a specified position (x=0.5, y=0.9) relative to the plot.
Adjust the values of these parameters to customize the appearance and position of the title and annotation according to your preferences.
Group Variable
Show the code
The x
parameter in the px.bar
function is set to ‘manufacturer’, which means the bars will be grouped by the ‘manufacturer’ variable on the x-axis. The y
parameter is set to a list [‘cty’, ‘hwy’], indicating that two sets of bars will be plotted for each manufacturer, one for ‘cty’ and another for ‘hwy’.
The barmode='group'
parameter ensures that the bars are grouped for each ‘manufacturer’.
The update_layout
method is used to set the titles for the x-axis (xaxis_title)
, y-axis (yaxis_title)
, and the overall plot (title_text)
. In this example, the plot represents the average city and highway mileage by manufacturer.
Axis formatting - Axis Scale removing Zero
Show the code
fig = px.bar(data, x='manufacturer', y=['cty', 'hwy'], barmode='group')
fig.update_layout(
xaxis_title="Manufacturer",
yaxis_title="Mileage",
title_text="Average City and Highway Mileage by Manufacturer",
xaxis=dict(showline=True, showgrid=False),
yaxis=dict(zeroline=False, showline=True, showgrid=False),
)
fig.show()
The xaxis_title
and yaxis_title
parameters in the update_layout
method are used to set the titles for the x-axis
and y-axis
, respectively.
The xaxis and yaxis dictionaries in the update_layout
method provide additional formatting options for the x-axis and y-axis. In this example:
xaxis=dict(showline=True, showgrid=False)
ensures that the x-axis has a visible line but no grid lines. yaxis=dict(zeroline=False, showline=True, showgrid=False)
ensures that the y-axis has no zero line (zeroline=False)
, a visible line, and no grid lines. These settings help customize the appearance of the plot by controlling the visibility of axis lines and grid lines.
Axis formatting - Axis Domain Sizing
Show the code
The xaxis_title
and yaxis_title
parameters in the update_layout
method are used to set the titles for the x-axis and y-axis, respectively.
The xaxis
dictionary in the update_layout
method includes the domain
parameter, which is set to [0.1, 0.9]. This parameter controls the size of the x-axis domain, determining the portion of the total width of the plot that the x-axis occupies. In this example, the x-axis is set to span from 10% to 90% of the total width.
Adjust the values of the domain
parameter as needed to control the sizing of the x-axis in your plot.
Reference marks - Verticle Reference Line with Color
Show the code
fig = px.bar(data, x='manufacturer', y=['cty', 'hwy'], barmode='group')
fig.update_layout(
xaxis_title="Manufacturer",
yaxis_title="Mileage",
title_text="Average City and Highway Mileage by Manufacturer"
)
fig.add_shape(
dict(
type="line",
x0="Your_X_Value", # Specify the x-coordinate of the line
x1="Your_X_Value", # Specify the x-coordinate of the line
y0=0,
y1=1,
line=dict(color="red"), # Specify the color of the line
)
)
fig.show()
The add_shape
method is used to add a reference line to the plot. In this example, a vertical reference line is added to the x-axis at the specified x-coordinate value.
The type="line"
parameter specifies that the added shape is a line. The x0 and x1 parameters are set to “Your_X_Value” to specify the x-coordinates where the line starts and ends. Replace “Your_X_Value” with the actual x-coordinate value where you want the reference line.
The y0
and y1
parameters set the starting and ending points on the y-axis. In this example, they are set to 0 and 1, respectively.
The line
dictionary inside the shape specifies the attributes of the line, such as the color. In this case, the line color is set to red. Adjust the values as needed to customize the appearance of the reference line.
Reference marks - Verticle Reference Line with Text
Show the code
fig = px.bar(data, x='manufacturer', y=['cty', 'hwy'], barmode='group')
fig.update_layout(
xaxis_title="Manufacturer",
yaxis_title="Mileage",
title_text="Average City and Highway Mileage by Manufacturer"
)
fig.add_shape(
dict(
type="line",
x0="Your_X_Value", # Specify the x-coordinate of the line
x1="Your_X_Value", # Specify the x-coordinate of the line
y0=0,
y1=1,
line=dict(color="red"), # Specify the color of the line
)
)
fig.add_annotation(
text="Your Text",
x="Your_X_Value", # Specify the x-coordinate of the text
y=500, # Adjust the y-coordinate of the text as needed
showarrow=False
)
fig.show()
The add_shape
method is used to add a reference line to the plot. In this example, a vertical reference line is added to the x-axis at the specified x-coordinate value.
The type="line"
parameter specifies that the added shape is a line. The x0 and x1 parameters are set to “Your_X_Value” to specify the x-coordinates where the line starts and ends. Replace “Your_X_Value” with the actual x-coordinate value where you want the reference line.
The y0
and y1
parameters set the starting and ending points on the y-axis. In this example, they are set to 0 and 1, respectively.
The line
dictionary inside the shape specifies the attributes of the line, such as the color. In this case, the line color is set to red.
The add_annotation
method is used to add text annotation to the plot. The text parameter is set to “Your Text”, and the x and y parameters specify the coordinates where the text will be placed. Adjust the values of these parameters as needed to customize the appearance of the reference line and text.