State the additive model with harmonic seasonal component
Simulate a time series with harmonic seasonal components
Identify an appropriate function to model the trend in a given time series
Identify a parsimonious set of harmonic terms for use in a regression model
Fit the additive model with harmonic seasonal component to real-world data
Evaluate residuals using a correlogram and partial correlogram to ensure they meet the assumptions
Apply model selection criteria
Use AIC to aid in model selection
Preparation
Review Section 5.6
Class Activity: Monthly Average High Temperature in Rexburg (20 min)
Visualization of the Time Series
Consider the mean monthly high temperature in Rexburg.
Show the code
weather_df <- rio::import("https://byuistats.github.io/timeseries/data/rexburg_weather_monthly.csv") |>mutate(dates =my(date_text)) |>filter(dates >=my("1/2008") & dates <=my("12/2023")) |>rename(x = avg_daily_high_temp) |>mutate(TIME =1:n()) |>mutate(cos1 =cos(2* pi *1* TIME/12),cos2 =cos(2* pi *2* TIME/12),cos3 =cos(2* pi *3* TIME/12),cos4 =cos(2* pi *4* TIME/12),cos5 =cos(2* pi *5* TIME/12),cos6 =cos(2* pi *6* TIME/12),sin1 =sin(2* pi *1* TIME/12),sin2 =sin(2* pi *2* TIME/12),sin3 =sin(2* pi *3* TIME/12),sin4 =sin(2* pi *4* TIME/12),sin5 =sin(2* pi *5* TIME/12),sin6 =sin(2* pi *6* TIME/12)) |>as_tsibble(index = TIME)weather_df |>as_tsibble(index = dates) |>autoplot(.vars = x) +geom_smooth(method ="lm", se =FALSE, color ="#F0E442") +labs(x ="Month",y ="Mean Daily High Temperature (Fahrenheit)",title ="Time Plot of Mean Daily Rexburg High Temperature by Month",subtitle =paste0("(", format(weather_df$dates %>%head(1), "%b %Y"), endash, format(weather_df$dates %>%tail(1), "%b %Y"), ")") ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5),plot.subtitle =element_text(hjust =0.5) )
Model Selection
Standardizing the Time Variable
Standardizing the Time Variable
To avoid serious floating point errors, we standardize the time variable. First, compute the sine and cosine terms using the original time variable, then transform the time variable by subtracting its mean and dividing by its standard deviation. (In other words, compute a \(z\)-score.)
The model is adjusted accordingly after fitting.
Warning
When the independent variable (the measure of time) is large, floating point errors in the computation of the regression coefficents can be substantial.
We will demonstrate standardizing the time variable below.
Computing the standardized time variable
Our time variable was a simple incremented value counting the months ranging from 1 (representing 1/2008) to 192 (representing 12/2023).
We standardize this variable by the transformation:
Now, we fit the trend components of the models using zTIME instead of TIME. We will start by modeling a cubic trend term.
Cubic Trends
Cubic Trend: Full Model
Visually, we can identify a positive linear trend in the data. It is possible that there are higher-order properties of the trend. We will include a quadratic and cubic term in our search for a model.
In addition to modeling the trend, we need to include terms for the seasonal component. We start with a full model that includes all six of the the sine and cosine terms from the summation above.
We look for the smallest value of the AIC, AICc, and BIC criteria. These methods do not have to agree with each other, and they provide different perspectives based on their various algorithms. Notice that the AIC and AICc criteria both suggest the model we titled “Reduced Quadratic 1:”
If there are two competing models that are both satisfactory, it is usually preferable to choose the more parsimonious (or simpler) model.
Sometimes there will be a model that does not attain the lowest value of these measures, yet you may still want to use it if the AIC, AICc, and BIC values are not too much larger than the smallest values. Why might you do this? If the model is particularly interpretable or makes logical sense in the context of the physical situation, it is better than a model with a lower AIC that is not readily interpretable.
You may even choose to include terms that are not statistically significant, if you determine they are practically important. For example, if a quadratic term is significant, but the linear term is not, it is a good practice to include the linear term anyway.
Notice that the linear models corresponding to the “Reduced Quadratic 1” and “Reduced Quadratic 5” models have AIC/AICc/BIC values that are not much larger than the minimum values. Given other external evidence related to global warming, it is unlikely that the second derivative of the function representing the Earth’s mean temperature is negative. In other words, it does not seem like the rate at which the earth is warming is decreasing. Even if there is a quadratic component to the trend, it is not visible to the eye in the time plot.
For these reasons, we will apply the “Reduced Linear 5” model. This model implies a linear trend in the mean temperature of the Earth. The BIC for this model is not much bigger than the BIC for the “Reduced Quadratic 5” model. This model is simpler than the “Reduced Quadratic 1,” “Reduced Quadratic 5,” and “Reduced Linear 1” models.
If we want, we could rewrite this by expanding out the polynomial in the first three terms, but it is not necessary.
Comparison of Fitted Values
Comparison of Fitted Values
This time plot shows the original temperature data superimposed with the fitted curves based on three models.
Show the code
num_months <- weather_df |>as_tibble() |> dplyr::select(TIME) |>tail(1) |>pull()df <-tibble( TIME =seq(1, num_months, 0.01) ) |>mutate(cos1 =cos(2* pi *1* TIME/12),cos2 =cos(2* pi *2* TIME/12),cos3 =cos(2* pi *3* TIME/12),cos4 =cos(2* pi *4* TIME/12),cos5 =cos(2* pi *5* TIME/12),cos6 =cos(2* pi *6* TIME/12),sin1 =sin(2* pi *1* TIME/12),sin2 =sin(2* pi *2* TIME/12),sin3 =sin(2* pi *3* TIME/12),sin4 =sin(2* pi *4* TIME/12),sin5 =sin(2* pi *5* TIME/12),sin6 =sin(2* pi *6* TIME/12)) |>mutate(zTIME = (TIME -mean(TIME)) /sd(TIME)) |>as_tsibble(index = TIME)quad1_ts <- reduced1_quadratic_lm |>forecast(df) |>as_tibble() |> dplyr::select(TIME, .mean) |>rename(value = .mean) |>mutate(Model ="Quadratic 1")quad5_ts <- reduced5_quadratic_lm |>forecast(df) |>as_tibble() |> dplyr::select(TIME, .mean) |>rename(value = .mean) |>mutate(Model ="Quadratic 5")linear5_ts <- reduced5_linear_lm |>forecast(df) |>as_tibble() |> dplyr::select(TIME, .mean) |>rename(value = .mean) |>mutate(Model ="Linear 5")data_ts <- weather_df |>as_tibble() |>rename(value = x) |>mutate(Model ="Data") |> dplyr::select(TIME, value, Model)combined_ts <-bind_rows(data_ts, quad1_ts, quad5_ts, linear5_ts) point_ts <- combined_ts |>filter(TIME ==floor(TIME))okabe_ito_colors <-c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")combined_ts |>ggplot(aes(x = TIME, y = value, color = Model)) +geom_line() +geom_point(data = point_ts) +labs(x ="Month Number",y ="Temperature (Fahrenheit)",title ="Monthly Average of Daily High Temperatures in Rexburg",subtitle =paste0("(", format(weather_df$dates %>%head(1), "%b %Y"), endash, format(weather_df$dates %>%tail(1), "%b %Y"), ")") ) +scale_color_manual(values = okabe_ito_colors[1:nrow(combined_ts |>as_tibble() |>select(Model) |>unique())], name ="" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5),plot.subtitle =element_text(hjust =0.5),legend.position ="top", # Position the legend at the toplegend.direction ="horizontal"# Set the legend direction to horizontal )
Small-Group Activity: Fall River Flow Rate (35 min)
The Fall River is a tributary of the Henrys Fork of the Snake River northeast of Rexburg, Idaho. The United States Geological Survey (USGS) provides data every fifteen minutes on the flow rate (in cubic feet per second, cfs) of this river.
This map shows the location of the monitoring station. On the left, you can see where Highway 20 intersects with Main St. in Ashton, Idaho.
Here is a glimpse of the raw data:
agency_cd
site_no
datetime
flow
code
USGS
13046995
1998-10-01 00:00 MDT
662
A:[91]
USGS
13046995
1998-10-01 00:15 MDT
655
A:[91]
USGS
13046995
1998-10-01 00:30 MDT
655
A:[91]
USGS
13046995
1998-10-01 00:45 MDT
662
A:[91]
⋮
⋮
⋮
⋮
⋮
USGS
13046995
2023-09-30 23:00 MDT
547
A
USGS
13046995
2023-09-30 23:15 MDT
547
A
USGS
13046995
2023-09-30 23:30 MDT
547
A
USGS
13046995
2023-09-30 23:45 MDT
552
A
The 2023 water year goes from October 1, 2022 to September 30, 2023. The data starts with the 1999 water year and goes through the 2023 water year.
We will average the flow rates across the first and last half of each month, so there are 24 values in a year. For various reasons, there are gaps in the data, even after averaging across the first and last half of the month. We impute the missing values by linear interpolation using the zoo package. This just means that when we encounter NAs, we fit a line between the most recent observed value and the next observed value. Then, we fill all the NAs in with the value given by the liner relationship between the two points.
fallriver_ts <- fallriver_ts0 |># compute additional variables needed for the regressionmutate(TIME =1:n()) |>mutate(cos1 =cos(2* pi *1* TIME /24),cos2 =cos(2* pi *2* TIME /24),cos3 =cos(2* pi *3* TIME /24),cos4 =cos(2* pi *4* TIME /24),cos5 =cos(2* pi *5* TIME /24),cos6 =cos(2* pi *6* TIME /24),cos7 =cos(2* pi *7* TIME /24),cos8 =cos(2* pi *8* TIME /24),cos9 =cos(2* pi *9* TIME /24),cos10 =cos(2* pi *10* TIME /24),cos11 =cos(2* pi *11* TIME /24),cos12 =cos(2* pi *12* TIME /24),sin1 =sin(2* pi *1* TIME /24),sin2 =sin(2* pi *2* TIME /24),sin3 =sin(2* pi *3* TIME /24),sin4 =sin(2* pi *4* TIME /24),sin5 =sin(2* pi *5* TIME /24),sin6 =sin(2* pi *6* TIME /24),sin7 =sin(2* pi *7* TIME /24),sin8 =sin(2* pi *8* TIME /24),sin9 =sin(2* pi *9* TIME /24),sin10 =sin(2* pi *10* TIME /24),sin11 =sin(2* pi *11* TIME /24),# sin12 = sin(2 * pi * 12 * TIME / 24) # zero for all integer values of t ) # plot the time seriesfallriver_ts |>autoplot(.vars = flow) +labs(x ="Date",y ="Flow (cubic feet per second, cfs)",title ="Fall River Flow Rate",subtitle ="Above the Yellowstone Canal near Squirrel, Idaho" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5),plot.subtitle =element_text(hjust =0.5) )
Check Your Understanding
Use the Fall River flow data to do the following.
Explore various linear models for the flow of the Fall River over time.
Choose a model to represent the data, and justify your decision.