This project outline and background information have been provided to assist you as you complete your project. You should assume the reader of your work has no knowledge or access to this information.
How much explosive material remains in the soil? The military is concerned about amount of explosives that remain in the soils of training ranges because soldiers are out on the training range and explosive residues are carcinogens (they cause cancer). Collecting a representative sample from the training ranges is difficult to do and researchers are continuing to work on developing, refining, and validating sampling methods. Source
Our work in this project relies on assumptions and use of probability distributions. We will (1) fit probability distributions to data, (2) create simulated samples from those fitted probability distributions, and (3) use the fitted probability distributions to provide probability information about explosive residue in the surface soil of a training range.
The Environment Protection Agency (EPA) specifies regulatory levels of Nitroglycerin concentrations for human heath. For this project, we will use 10 mg/kg1 as the toxicity threshold for surface soil of this training range. Source
There are 4 probability models we’ll be exploring in this project. They are the uniform distribution, the normal distribution, the gamma distribution, and the exponential distribution.
The probability density function for the uniform distribution is
We have seen this probability model before in our Example Project.
The probability is the same (uniform) for all measurements between \(a\) and \(b\). The command ?dunif
or
?runif
in your R console will access the R documentation
for the uniform distribution.
The probability density function for the normal distribution is
The probability that measurements are near \(\mu\) is larger than the probability that
measurements will be far from \(\mu\).
This distribution is symmetric and defined for all values of \(x\) (both positive and negative values are
possible). We have seen this probability model before in our Example
Project and in Project 2. The command ?dnorm
or
?rnorm
in your R console will access the R documentation
for the normal distribution distribution.
The probability density function for the gamma distribution is
The function \(\Gamma(x)\) is call
the gamma function (NOT the gamma distribution).
The gamma distribution and the gamma
function are different functions. One
way to think about the gamma function is a generalization of the
factorial to noninteger values. We have seen this probability model
before in our Example Project. The command ?dgamma
or
?rgamma
in your R console will access the R documentation
for the gamma distribution. The command
?gamma
in your R console will access the R documentation
pages about the gamma function.
The probability density function for the exponential distribution is
The exponential distribution is often thought of as a “waiting time”
model. The command ?dexp
or ?rexp
in your R
console will access the R documentation for the exponential
distribution.
Create an R Markdown file.
Use the code below to read in the soil data.
# run this line once in the console to get the data4soils package
#devtools::install_github("byuidatascience/data4soils")
library(data4soils)
Ng <- cfbp_fpjuliet$ng
This code creates a vector called “Ng”. The vector contains the measurements Nitroglycerin (Ng) measured as mg/kg found in 100 soil samples.
Familiarize your self with this Nitroglycerin data.
?mean
in your Console to access the R
documentation for the mean()
command.?var
in your Console to
access the R documentation for the var()
command.Familiarize yourself with the 4 probability models given before Task 1.
We previously explored models \(f_0\), \(f_1\) (now \(h=\mu\) and \(a=\sigma^2\)), and \(f_2\) (now \(h=0\), \(a = \alpha\), and \(b=\beta\)). Read the narrative for Example Project Task 2 to remind yourself what we noticed about how changing the parameters in these models changes the behavior of the functions \(f_0\), \(f_1\), and \(f_2\). In this project, we are exploring explosives in soil rather than brightness of light bulbs. Use the Desmos files below for \(f_0\), \(f_1\), and \(f_2\) in any way they are helpful as you complete this project.
The function \(f_3\) is new. Use the Desmos file below to dynamically explore how changing \(\lambda\) changes the behavior of \(f_3\).
As part of your analysis, create (in R) plots of at least 2 representative curves illustrating what you learned in your parameter exploration of \(f_3\). (You do not need to include plots of your parameter exploration for the other 3 models). In your narrative, summarize your observations about the parameter \(\lambda\) in terms of transformations of functions (shifts, reflections, stretch) and the mathematical behavior of the functions (increasing, decreasing, constant, positive, negative, nonnegative).
Visually fit \(f_2\) and \(f_3\) to the density histogram of the 100 Nitroglycerin measurements. Your plot should include a histogram (the data) and a curve (the model).
Use the parameter values of your visually fitted \(f_2\) model and the rgamma()
command to simulate a sample of 25000 random measurements. Then use that
sample to approximate how many measurements out of 25000 will have more
than 10 mg/kg of explosive.
set.seed()
command in R to set the seed so your
simulated samples and probability calculations are reproducible. Use
your assigned seed.#set.seed(2021)
#tmp2 <- rgamma(25000, shape = alpha, rate = beta)
#length(which(tmp2 > 10))
Use the parameter values of your visually fitted \(f_3\) model and the rexp()
command to simulate a sample of 25000 random measurements. Then use that
sample to approximate how many measurements out of 25000 will have more
than 10 mg/kg of explosive.
set.seed()
command in R to set the seed so your
simulated samples and probability calculations are reproducible. Use
your assigned seed.#set.seed(2021)
#tmp3 <- rexp(25000, rate = lambda)
#length(which(tmp3 > 10))
CHECK YOUR WORK:
Organize your work into a cohesive analysis and submit the html file on Canvas.
Create a new R Markdown file.
Answer the question, “What is the amount of explosives in the soil?”
CHECK YOUR WORK:
While it is possible to fit all four probability models, to this data, explain why \(f_0\) and \(f_1\) should not be used as models for the amount of Nitroglycerin in the soil in this situation. How are these models inconsistent with the information we see in the the density histogram of the Nitroglycerin data?
Describe in 4-6 sentences how the information (or answer) you get from the data depends on the general model you assume. Use results from your calculations above to illustrate this idea. Why is this an important concept to understand when working with models and data?
Organize your work into a cohesive analysis and submit the html file on Canvas. Your narrative should stand alone apart from the “project instructions” (meaning your reader should not need the instructions for the project to understand what you are doing or explaining) and separate from the individual Tasks (meaning you should not assume your reader has read any of your previous narratives). It is your job in the narrative to lead your reader from the background and question to given data and 4 general models, fitting those models, and answering a question about the data using those fitted models.
Reflect on your work for this project. At the bottom of your report include the following in a brief (1-2 paragraph) reflection.
This number is a simplified story for illustrative purposes only.↩︎