eesim
packageThis package allows you to simulate time series of environmental health data and perform simulation-based power analyses and other measures of model performance. The package includes four main parts:
The user has the option to customize different aspects of the simulation at each of these steps.
The package creates simulated time series data that are relevant for environmental epidemiology studies of ambient exposures (e.g., studies of acute mortality risks associated with daily air pollution concentration, daily temperature, or occurance of a community-wide extreme event like a heat wave). Simulated environmental datasets like those created by the package can be used in to assess the performance of statistical models meant to estimate the association between exposure level and outcome risk, to estimate power for a planned study, and to develop a better understanding of the data generating processes behind observed environmental datasets. Such time series are often characterized by both seasonal and long-term trends in both the exposure of interest and the outcome. For example, the following plot shows time series of daily ozone concentration (in parts per billion [ppb]) and cardiovascular deaths in Chicago, IL (1996–2000), with smoothed lines overlaid on the raw data to show patterns over time.
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The main function of this package is the eesim
function.
You can use the eesim
function to conduct all four steps of
the simulation process at once (generate exposure data, generate outcome
data, fit models to simulated data, and evaluate model performance).
The eesim
function requires inputs on:
n
: The desired number of observations per simulated
dataset (for a daily time series, this is the desired number of days in
the simulated dataset)n_reps
: The desired number of simulated datasetsexposure_type
: Whether the exposure is binary (e.g.,
occurence of an extreme event like a heat wave or wildfire) or
continuous (e.g., concentration of a pollutant)rr
: The relative rate of the outcome associated with
the exposure. For a binary exposure, this is the relative rate
associated with the exposure compared to a similar day without the
exposure. For a continuous exposure, this is the relative rate
associated with a one-unit increase in the exposure.model
: The model to be used to estimate the association
between exposure and outcome in the simulated datasets, either to
estimate power of a planned analysis or to otherwise evaluate the
performance (e.g., coverage, bias) of a model on the simulated
datasets.A number of optional inputs can also be specified, including arguments to adjust the shape of seasonal or long-term trends in the exposure or outcome data or custom arguments to use at different steps of the data generation.
The function returns a list with three elements. The first element is a list with all the simulated datasets. The second element gives simulation-specific results for each simulated dataset: the estimated effect, standard error, t- and p-values, and upper and lower 95% confidence bounds when a model was applied to each of the simulated datasets. The third element gives some measures of model assessment, assessed over all simulations, including the mean beta and relative risk estimates across simulations.
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
For example, in the observed data from Chicago, IL, shown in the plots above, daily ozone concentrations have a mean of about 20 ppb and standard deviation of about 7 ppb after removing seasonal and long-term trends. The average number of cardiovascular deaths per day is around 50. Here is the code, and a plot of the resulting data, for generating a dataset with similar characteristics for use in a power analysis or to evaluate model performance (later in the vignette, we will show how to use customization to further improve the simulation of data for this example, including avoiding negative values of ozone concentration in simulated data):
sim_chicago <- create_sims(n_reps = 1, n = 365 * 5, central = 20, sd = 7,
exposure_type = "continuous", exposure_trend = "cos1",
exposure_amp = -.6, average_outcome = 50,
outcome_trend = "cos1", outcome_amp = 0.2,
rr = 1.0005, start.date = "1996-01-01")
head(sim_chicago[[1]])
## date x outcome
## 1 1996-01-01 8.053697 56
## 2 1996-01-02 -2.333420 66
## 3 1996-01-03 4.255316 54
## 4 1996-01-04 9.810233 48
## 5 1996-01-05 14.909640 58
## 6 1996-01-06 10.322948 65
This simulated data can also be visualized using the
calendar_plot
function that comes with the package:
a <- calendar_plot(sim_chicago[[1]] %>% select(date, outcome), type = "continuous",
legend_name = "Outcome") +
ggtitle("Outcome")
## Warning: `mutate_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `mutate()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the eesim package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `group_by()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the eesim package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
b <- calendar_plot(sim_chicago[[1]] %>% select(date, x), type = "continuous") +
ggtitle("Exposure")
grid.arrange(a, b, ncol = 1)
You can use the eesim
function to generate multiple
similar simulated datasets and investigate the performance of a
specified model in estimating the association between ozone
concentration and the risk of cardiovascular death in 20 simulated
datasets. You must write a function with the code to fit the model you
desire to fit to the simulated data (more details for writing this
function are provided later in the vignette), and then you can use the
eesim
function to generate lots of simulated datasets, fit
that model, and assess its performance using a call like:
ex_sim <- eesim(n_reps = 100, n = 365 * 5, central = 20, sd = 7,
exposure_type = "continuous", exposure_trend = "cos1",
exposure_amp = -.6, average_outcome = 50,
outcome_trend = "cos1", outcome_amp = 0.2,
rr = 1.2, start.date = "1996-01-01",
custom_model = spline_mod, custom_model_args = list(df_year = 7))
The eesim
function returns a list with three
elements:
## [1] "simulated_datasets" "indiv_performance" "overall_performance"
The first element of the returned object is a list with all of the simulated datasets. For example, you can create a calendar plot of exposure in the first simulated dataset using the call:
The second element of the returned object gives the results of fitting the model to each of the simulated datasets. It can be used to explore the behavior of individual simulations:
## Estimate Std.Error t.value p.value lower_ci upper_ci
## 1 0.1822613 0.0005431742 335.5485 0 0.1811967 0.1833259
## 2 0.1818704 0.0004819506 377.3632 0 0.1809258 0.1828150
## 3 0.1813407 0.0005127487 353.6638 0 0.1803357 0.1823456
## 4 0.1824330 0.0005117458 356.4915 0 0.1814300 0.1834360
## 5 0.1835011 0.0005139293 357.0551 0 0.1824938 0.1845084
## 6 0.1819087 0.0005127441 354.7748 0 0.1809037 0.1829137
After running the simulation, you can look at the relative risk point
estimate and 95% confidence interval from each of the 100 simulations,
as well as which 95% confidence intervals include the true relative
rate, using the coverage_plot
function that comes with the
package:
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `arrange()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the eesim package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The third element of the list returned by a call to
eesim
gives the following overall summaries of model
performance across all simulations:
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Variable | Description |
---|---|
beta_hat |
Mean estimate: The mean of the estimated log relative rate over all simulations. |
rr_hat |
Mean estimated relative rate: The mean of the estimated relative rate over all simulations. |
var_across_betas |
Variance across estimates: Variance of the point estimates (estimated log relative risk) over all simulations. |
mean_beta_var |
Mean variance of estimate: The mean of the variances of the estimated effect (estimated log relative risk) across all simulations. |
percent_bias |
Relative bias: Difference between the estimated log relative risk and true log relative risk as a proportion of the true log relative risk. |
coverage |
95% confidence inverval coverage: Percent of simulations for which the 95% confidence interval estimate of log relative risk includes the true value of log relative risk. |
power |
Power: Percent of simulations for which the null hypothesis that the log relative risk equals zero is rejected based on a p-value of 0.05. |
For example, here are the overall results for the simulation fit above:
## beta_hat rr_hat var_across_betas mean_beta_var percent_bias coverage
## 1 0.1822211 0.1822211 3.011244e-07 2.780242e-07 0.05510361 0.93
## power
## 1 1
In later sections of this vignette, we will show how to customize steps in the generation of the simulated data to further improve this example simulation.
As another basic example, here is a plot of the dates of extreme heat days (defined as a day with temperature at or above the 98 percentile temperature in Chicago between 1987 and 2000) in the observed Chicago dataset (points are jittered along the y-axis to limit overlapping):
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In this observed data, there is (unsurprisingly) a strong seasonal trend in this binary exposure of extreme heat days. The percent of days that are extreme heat days is 0% for all months expect June (about 5% of days in observed data were extreme heat days), July (about 12% of days), and August (about 2% of days). Similar exposure time series can be simulated with the call:
sim_chicago2 <- create_sims(n_reps = 1, n = 365 * 5, sd = 1,
central = c(0, 0, 0, 0, 0, 0.05, 0.12, 0.02, 0, 0, 0, 0),
exposure_type = "binary", exposure_trend = "monthly",
exposure_amp = -.6, average_outcome = 50,
outcome_trend = "cos1", outcome_amp = 0.2,
rr = 1.05, start.date = "1996-01-01")
Here is an example of the simulated exposure data:
Again, both the observed and simulated exposure data can also be
plotted using the calendar_plot
function:
a <- chicagoNMMAPS %>%
mutate(temp = temp >= quantile(temp, probs = 0.98)) %>%
tbl_df() %>%
filter(year >= 1996) %>%
select(date, temp) %>%
calendar_plot(type = "discrete", labels = c("Extreme heat day", "Other day")) +
ggtitle("Observed exposure data")
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
b <- sim_chicago2[[1]] %>%
select(date, x) %>%
calendar_plot(type = "discrete", labels = c("Extreme heat day", "Other day")) +
ggtitle("Simulated exposure data")
grid.arrange(a, b, ncol = 1)
The comparison of the observed and simulated data in this case suggests some clustering in the observed data that is not evident in the simulated data, suggesting that the probability of exposure may be higher on a day near other extreme heat days.
The eesim
function can be used to assess the performance
of a GLM in estimating relative risk of cardiovascular mortality for
extreme heat days compared to other days using:
ex_sim2 <- eesim(n_reps = 100, n = 365 * 5,
central = c(0, 0, 0, 0, 0, 0.05, 0.12, 0.02, 0, 0, 0, 0),
exposure_type = "binary", exposure_trend = "monthly",
exposure_amp = -.6, average_outcome = 50,
outcome_trend = "cos1", outcome_amp = 0.2,
rr = 1.05, start.date = "1996-01-01",
custom_model = spline_mod, custom_model_args = list(df_year = 7))
## This function may take a minute or two to run, especially if you are
## creating lots of replications (`n_reps`).
As before, a plot of CI coverage can be created with
coverage_plot
:
Here are the overall estimates in this case for model performance:
## beta_hat rr_hat var_across_betas mean_beta_var percent_bias coverage
## 1 0.04996982 0.04996982 0.0009712768 0.0009160301 -2.417805 0.92
## power
## 1 0.43
In this case, the expected power is low.
The power_calc
function in the package allows you to
extend on this simulation functionality to create power curves for an
analysis given an anticipated underlying process of data generation.
This function will create simulations for several different values of
number of days in the study (n
), average daily outcome
counts (average_outcome
), or expected association between
exposure and outcome (rr
).
For example, the following call generates a power curve that explores how expected power changes with an increasing number of days for the heat wave analysis example just presented (as a warning, this call takes a few minutes to run, since it’s simulating many datasets):
ex_power_calc <- power_calc(varying = "n", values = floor(365.25 * seq(1, 20, by = 5)),
n_reps = 100, rr = 1.05,
central = c(0, 0, 0, 0, 0, 0.05, 0.12, 0.02, 0, 0, 0, 0),
exposure_type = "binary", exposure_trend = "monthly",
exposure_amp = -.6, average_outcome = 50,
outcome_trend = "cos1", outcome_amp = 0.2,
custom_model = spline_mod, custom_model_args = list(df_year = 7),
plot = FALSE)
## This function may take a minute or two to run, especially with lots of
## replications (`n_reps`) or options for `values`.
To demonstrate how the eesim
function works, here is a
breakdown of each of the four main parts: generating exposure data,
generating outcome data, fitting models, and evaluating models. The
helper functions used for each step are described in detail in this
section.
The first task of the package is generating exposure data. This can
be done with the sim_exposure
function. In this function,
the user can specify whether he or she would like to generate exposure
data that is binary or continuous (exposure_type
). For
continuous exposure data, the user must specify the mean
(central
) and standard deviation (sd
) of the
exposure data. For example, the following call simulates a dataframe of
exposure data for an exposure that is normally distributed, with a mean
value of 50, a standard deviation of 5, and no long-term or seasonal
trends:
x_cont <- sim_exposure(n = 1000, central = 50, sd = 5, exposure_type = "continuous")
x_cont %>% slice(1:5)
## date x
## 1 2001-01-01 49.26604
## 2 2001-01-02 42.91050
## 3 2001-01-03 52.53690
## 4 2001-01-04 51.53205
## 5 2001-01-05 52.42711
You can plot a calendar plot of this simulated exposure time series
using the calendar_plot
function that comes with the
package. Within this function, the type of data (“continuous” or
“discrete”) must be specified:
You can similarly use the sim_exposure
function to
simulate a binary exposure (e.g., occurence of an extreme event). For
binary exposure data, the central
argument of
sim_exposure
must also be expressed, but in this case it
gives the probability of exposure on a study day:
## date x
## 1 2001-01-01 0
## 2 2001-01-02 0
## 3 2001-01-03 1
## 4 2001-01-04 0
## 5 2001-01-05 0
Again, the calendar_plot
function can be used to
visualize the generated time series. In the case of binary exposure
data, the labels to be used in the legend for each outcome level must
also be specified using the labels
argument:
So far, these sim_exposure
calls have been used to
simulate basic exposure data, without long-term or seasonal trends.
However, for environmental epidemiology applications, exposure data
often has a seasonal trend and / or long-term trend, and these temporal
trends can serve as confounders in assessing the association between
time-varying environmental exposures and health outcomes. The
sim_exposure
function therefore includes options to
generate exposure data with long-term and seasonal trends relevant to
environmental time series studies, through the trend
argument.
The default for sim_exposure
is to simulate the exposure
data without a time trend (trend = "no trend"
). However, we
have also built in several time trends from which a user can to choose
to simulate exposure data with a time trend, either seasonal or
long-term or both, based on trend patterns used in a simulation study of
case-crossover studies as a method of controlling for seasonal and
long-term trends in environmental epidemiology studies (Bateson and Schwartz 1999). These trend
patterns differ slightly depending on whether the user is simulating
binary or continuous data. Below are plots of the built-in trends for
continuous exposure data from which the user may choose.
You can use the amp
argument to adjust the seasonal
trend in any of the patterns with a seasonal trend. For example, here
are plots of trends using tren = "cos1linear"
for different
values of amp
:
The long-term trend in expected values can be changed in a similar
way with the exposure_trend
argument in
eesim
.
Here is an example of generating continuous exposure data with a “cos1linear” trend for an exposure with a mean value of 50 and a standard deviation of 10:
testexp <- sim_exposure(n = 365 * 3, central = 50, sd = 10, trend = "cos1linear",
exposure_type = "continuous")
a <- ggplot(testexp, aes(x = date, y = x)) +
geom_point(alpha = 0.5, size = 0.8) +
coord_cartesian(ylim = c(0,110)) +
labs(title = "Exposure with a 'cos1linear' trend", x = "Date", y="Exposure") +
theme_classic()
b <- calendar_plot(testexp, type = "continuous") +
ggtitle("Calendar plot of simulated exposure data") +
theme(legend.position = "bottom")
grid.arrange(a, b, ncol = 1)
Here is an example of changing the seasonal trend by changing the
value for amp
(the default value is 0.6) to simulate
exposure data for an exposure with a smaller seasonal trend and with
higher exposures typical in the summer than the winter:
small_amp <- sim_exposure(n = 365 * 3, central = 50, sd = 10, trend = "cos1linear",
amp = -0.3, exposure_type = "continuous")
a <- ggplot(small_amp, aes(x = date, y = x)) +
geom_point(alpha = 0.5, size = 0.8) +
coord_cartesian(ylim = c(0,110)) +
labs(title = "Exposure with a 'cos1linear' trend", x = "Date", y="Exposure") +
theme_classic()
b <- calendar_plot(small_amp, type = "continuous") +
ggtitle("Calendar plot of simulated exposure data") +
theme(legend.position = "bottom")
grid.arrange(a, b, ncol = 1)
The trend options are similar for binary exposure, but exclude
“curvilinear” and “cos1linear”. Further, binary exposures can also be
simulated using a “monthly” trend (trend = "monthly"
), in
which the probability of exposure can vary by month. When using this
“monthly” trend option, the cental
argument to
sim_exposure
should include a vector with 12 separate
probabilities (the first is for January, the second for February, etc.)
rather than a single probability. Here is an example of generating
binary exposure data with a monthly trend, starting from June 1, 2002,
with higher probability of the exposure in summer months than in winter
months:
testbin <- sim_exposure(n=1000, central = c(.05, .05, .1, .2, .4, .4, .5, .7, .5, .2, .1, .05),
trend = "monthly", exposure_type = "binary",
start.date = "2002-06-01")
a <- testbin %>%
mutate(x = factor(x, levels = c(0, 1), labels = c("Not exposed", "Exposed"))) %>%
ggplot(aes(x = date, y = x)) +
geom_jitter(alpha = 0.5, size = 0.7, fill = NA, width = 0, height = 0.1) +
theme_classic() +
labs(x = "Date", y = "Exposure")
b <- calendar_plot(testbin, type = "discrete", labels = c("Not exposed", "Exposed")) +
ggtitle("Calendar plot of simulated exposure data") +
theme(legend.position = "bottom")
grid.arrange(a, b, ncol = 1)
The sim_exposure
function works by first calculating the
expected exposure on any date in the simulated time series (figure
below, left). This expected value is a mean for a continuous exposure
and a probability for a binary exposure. The sim_exposure
function then draws random values from the appropriate distribution
(normal distribution for a continuous exposure, binomial distribution
for a binary exposure) based on this day-specific expected exposure
value and, in the case of continuous exposure, the standard deviation of
the exposure (figure below, right). For continuous exposure data, the
standard deviation specified in the call to eesim
should
measure the standard deviation of each point from its expected value
(i.e., from the expected line shown on the left below), not the overall
standard deviation of exposure values across all days in the simulated
data.
Later in this vignette, we show how you can further customize this step of generating exposure data through the use of a user-created function, allowing extensive further flexibility in simulating exposure data.
Next, the sim_outcome
function simulates outcome data.
The health data can have an underlying seasonal and / or long term trend
in its baseline value, and then that baseline is adjusted for the risk
associated with exposure, based on the generated exposure data for that
day. The baseline outcome count for a given day (Bt) are based on
a user-specified trend and user-specified average outcome per day over
the simulated time period. Further, the expected outcome count on a
given day is adjusted for exposure-related risk through a user-specified
relative rate per unit increase in exposure (RR) and the simulated
exposure for that day (Xt). The
eesim
function then uses the following equation to
calculate the expected outcome count (λ) on a given day in the simulated
time series, based on the expected baseline rate and exposure-related
risk for that day:
log (λt) = log (Bt) + log (RR) * Xt
For a binary outcome, the baseline count on a given day (Bt) is the expected outcome count for the day if there is not an event (e.g., in a heat wave study, a non-heat wave day). For a continuous exposure, the baseline count on a given day (Bt) is the expected outcome count for the day if exposure is at its mean value.
Once the expected count (λt) on each day of the simulated time series is calculated using this equation, the simulated count on each day is drawn as a random variable from a Poisson distribution with mean λt. Later we describe how customization can be used to simulate output counts in other ways.
Here is an example of generating health outcome data with an upward linear trend using exposure data with a “cos1” trend. In this case, there is a steady increase in the baseline outcome count over time, as well as a seasonal trend linked to the risk associated with the seasonally-varying exposure:
testexp2 <- sim_exposure(n = 1000, central = 100, sd = 10, trend = "cos1",
exposure_type = "continuous")
testout <- sim_outcome(exposure = testexp2, average_outcome = 22,
trend = "linear", rr = 1.01)
Here are plots of the resulting output:
As with the exposure simulation step, this step can also be extensively customized by using a user-created function. This customization will be demonstrated in a later section of the vignette.
Next, the eesim
package uses this process to generate
many simulated data sets and then to fit statistical models to these
generated datasets. This step allows tests of model performance. You
must create an R function that fits the model you’d like to fit to the
simulated dataset. This function needs to follow certain input / output
rules to work correctly in the eesim
framework. First, it
must input the simulated dataframe with the argument df
.
When writing the function, you should assume that this simulated
dataframe has at least the columns date
(in a Date format),
x
(numeric class, this gives a daily value for exposure,
with 0 for unexposed and 1 for exposed in the case of binary exposure),
and outcome
(non-negative integer, this gives the simulated
outcome count each day). Other arguments can also be passed to this
function if desired. The function should fit a desired model to the
simulated dataframe and then should return a numeric vector of length 6
with values, in this order, for the log relative risk point estimate
from the model (Estimate
), the standard error for this
point estimate (Std. Error
), the t-statistic for a
hypothesis test with the null hypothesis that this estimate is zero
(t value
), a p-value for that test
(Pr(>|t|)
), and the lower and upper 95% confidence
intervals for the point estimate (2.5 %
and
97.5 %
).
The spline_mod
function that comes with the package is
an example of such a function. In this case, the function fits a GLM to
the simulated data, with a natural cubic spline used to control for
long-term and seasonal trends in mortality. The function, in addition to
inputing the dataframe of simulated data (df
), also allows
an argument to use to set the smoothness of the time spline
(df_year
). Since the function is included in the package,
you can see its code by running the bare function name at the
console:
## function (df, df_year = 7)
## {
## dgrs_free <- df_year * as.numeric(diff(df[c(1, nrow(df)),
## "date"]))/365.4
## df$time <- scale(df$date, center = TRUE, scale = FALSE)
## mod <- stats::glm(outcome ~ x + splines::ns(time, round(dgrs_free)),
## data = df, family = stats::quasipoisson(link = "log"))
## out_1 <- summary(mod)$coef[2, ]
## out_2 <- stats::confint.default(mod)[2, ]
## out <- c(out_1, out_2)
## return(out)
## }
## <bytecode: 0x559ec235dd30>
## <environment: namespace:eesim>
Here are examples of applying this function to a simulated dataframe:
# Create simulated data
sims <- create_sims(n_reps = 10, n = 100, central = 100, sd = 10,
exposure_type="continuous", exposure_trend = "cos1",
exposure_amp = .6, average_outcome = 22,
outcome_trend = "no trend", outcome_amp = .6, rr = 1.01)
head(sims[[1]])
## date x outcome
## 1 2000-01-01 122.41840 37
## 2 2000-01-02 92.08674 24
## 3 2000-01-03 129.32211 41
## 4 2000-01-04 112.10240 32
## 5 2000-01-05 115.69492 27
## 6 2000-01-06 111.54103 21
## Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
## 0.0089715497 0.0023323847 3.8465137262 0.0002156036 0.0044001597 0.0135429397
## Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
## 0.0089715497 0.0023323847 3.8465137262 0.0002156036 0.0044001597 0.0135429397
The format_function
function can be used within the
modeling function to get the output in the correct format if running a
GLM or similar model.
Once you’ve created the function, you can input it in a call to
eesim
using the custom_model
argument. You can
pass any additional arguments (df_year
in our example)
through to the function using the custom_model_args
argument. This argument takes a list with the argument name and value
for each argument you wish to pass to the modeling function. For
example, the following call passes the spline_mod
function
shown above as the function to use for modeling the simulated data as
well as a value for its df_year
argument:
ex_sim2 <- eesim(n_reps = 100, n = 365 * 5,
central = c(0, 0, 0, 0, 0, 0.05, 0.12, 0.02, 0, 0, 0, 0),
exposure_type = "binary", exposure_trend = "monthly",
exposure_amp = -.6, average_outcome = 50,
outcome_trend = "cos1", outcome_amp = 0.2,
rr = 1.05, start.date = "1996-01-01",
custom_model = spline_mod, custom_model_args = list(df_year = 7))
The eesim
function does this by applying the modeling
function across all simulated datasets using a function called
fit_mods
. If you’d like, you can run that function
independently. The fit_mods
function outputs a data frame
with estimates of the log relative risk, p-values, and upper and lower
95% confidence bounds for each simulated data set.
Here is an example of fitting the spline model coded in the
spline_mod
function, with 7 degrees of freedom per year
used to model long-term and seasonal trends (df_year = 7
passed in a list to the model with the custom_model_args
argument):
fits <- fit_mods(data = sims, custom_model = spline_mod,
custom_model_args = list(df_year = 7))
fits
## Estimate Std.Error t.value p.value lower_ci upper_ci
## 1 0.008971550 0.002332385 3.846514 2.156036e-04 0.004400160 0.01354294
## 2 0.010804897 0.002454865 4.401422 2.789300e-05 0.005993450 0.01561634
## 3 0.010095605 0.002177084 4.637215 1.114356e-05 0.005828600 0.01436261
## 4 0.009087024 0.002045036 4.443455 2.373221e-05 0.005078827 0.01309522
## 5 0.009321388 0.002567388 3.630690 4.557402e-04 0.004289400 0.01435338
## 6 0.008666209 0.002175816 3.982969 1.324521e-04 0.004401688 0.01293073
## 7 0.007396508 0.002458316 3.008770 3.349955e-03 0.002578297 0.01221472
## 8 0.011146030 0.002463122 4.525164 1.729256e-05 0.006318400 0.01597366
## 9 0.008119633 0.001935722 4.194629 6.094218e-05 0.004325689 0.01191358
## 10 0.018574250 0.002534111 7.329690 7.238071e-11 0.013607483 0.02354102
As a note, the output of the fit_mods
function is the
output given as the second element of the list returned by a call to
eesim
.
The final step of the eesim
function is to evaluate
model performance across all simulations with several different
measures. Within the eesim
function, the
check_sims
function takes as inputs the true relative risk
as well as the results from fitting the modeling function to all the
simulations using the fim_mods
function. It returns values
for the mean effect estimate (log relative risk) and relative risk
estimates across all simulated data sets, variance of the estimates of
beta, the mean of the variances of each estimated log relative risk, the
relative bias of the mean of the log relative risks, the percent
coverage confidence intervals of the true log relative risk, and the
power of the test at the 5% significance level (see the table near the
beginning of the vignette).
Here is an example of the use of the check_sims
function:
## beta_hat rr_hat var_across_betas mean_beta_var percent_bias coverage
## 1 0.01021831 0.01021831 9.939583e-06 5.399292e-06 -2.693161 0.9
## power
## 1 1
In a run of eesim
, this output is given in the third
element of the returned list.
Internally, the functions used for this model assessment are:
beta_bias
beta_var
coverage_beta
mean_beta
power_beta
A few more details about how some of these assessments are measured are given below.
Two values are measured by the beta_var
function. First,
the variance across all estimates of log relative risk is measured
across all the simulations, using the equation:
$$ \text{variance of estimates} = E\left[\left(\hat{\beta_i} - \frac{1}{n}\sum_{i = 1}^n{\hat{\beta_i}}\right)\right] $$
where $\hat{\beta_i}$ is the estimated log relative risk for a single simulation out of n total simulations and E represents the expected value.
Second, the function measures the mean value of the variance estimated for β̂ for each simulation:
$$ \text{mean of estimate variances} = \frac{1}{n}\sum_{i = 1}^{n}{var(\hat{\beta_i})} $$
where $var(\hat{\beta_i})$ is the estimated variance of the estimated log relative risk for a single simulation out of n total simulations.
Here is the equation used by the beta_bias
function to
estimate relative bias in estimates from the simulated data:
$$ \text{relative bias} = 100*\frac{\beta - \frac{1}{n}\sum_{i = 1}^{n}{\hat{\beta}}}{\beta} $$ where β is the true log relative risk used to simulate the data and β̂ is the estimated log relative risk from simulation i (out of a total of n simulations).
The other main functionality of the eesim
package is to
run through simulations under varying data generating scenarios to
estimate expected power of an analysis under different scenarios. For
example, you can explore how expected power varies for different
expected effect sizes (relative risk of the outcome associated with a
change in exposure) or for different average daily number of outcomes.
This is run using the power_calc
function in the
package.
The power_calc
function allows you to put in varying
values for one of the following three specifications in the
simulations:
rr
): How you strongly expect the
exposure and outcome to be associatedn
): How long you expect
the study to last (or how many days of historical data you expect to be
able to collect)average_outcome
): For
the outcome of interest, how common it is on average on days in the
study (this will usually be strongly associated with the size of the
population being studied)For whichever of these you choose to vary, you can specify different
values to test. The power_calc
function then loops through
those values and runs eesim
for each of them. From this, it
can estimate the power for each value of the varying parameter.
Here is an example of running a power calculation with varying number
of days in the study (n
). The values
argument
is used to specify different values of n
we would like to
test (here, it’s testing power for studies with daily data for between 1
and 21 years):
pow <- power_calc(varying = "n", values = floor(365.25 * seq(1, 21, by = 5)), n_reps = 20,
central = 100, sd = 10, rr = 1.001, exposure_type = "continuous",
exposure_trend = "cos1", exposure_amp = .6, average_outcome = 22,
outcome_trend = "no trend", outcome_amp = .6,
custom_model = spline_mod, plot = TRUE)
## This function may take a minute or two to run, especially with lots of
## replications (`n_reps`) or options for `values`.
## Warning: Use of `dat$values` is discouraged.
## ℹ Use `values` instead.
## Warning: Use of `dat$power` is discouraged.
## ℹ Use `power` instead.
This call returns a dataframe with the estimated power for each of
the values of n
tested:
## values power
## 1 365 0.20
## 2 2191 0.60
## 3 4017 0.90
## 4 5844 0.95
## 5 7670 1.00
Because the argument plot
is set to TRUE
,
it also generates a power curve plot as a side effect, as shown
above.
Here is another example, but this time we assume that the study will have 4,000 days of daily data, but we explore estimated power as the average daily outcome count varies:
pow2 <- power_calc(varying = "average_outcome", values = c(1, 5, 10, 20, 30, 40),
n_reps = 20,
central = 100, sd = 10, rr = 1.001, exposure_type = "continuous",
exposure_trend = "cos1", exposure_amp = .6, n = 4000,
outcome_trend = "no trend", outcome_amp = .6,
custom_model = spline_mod, plot = TRUE)
## This function may take a minute or two to run, especially with lots of
## replications (`n_reps`) or options for `values`.
## Warning: Use of `dat$values` is discouraged.
## ℹ Use `values` instead.
## Warning: Use of `dat$power` is discouraged.
## ℹ Use `power` instead.
## values power
## 1 1 0.05
## 2 5 0.25
## 3 10 0.55
## 4 20 0.70
## 5 30 0.85
## 6 40 0.95
Because these power curves and calculations are based on simulated
data, there will be some randomness to results. Curves will be smoother
as more simulations are used for each run (n_reps
),
although this will also increase the time needed to run the
simulation.
An important feature of eesim
is that the user can
create and use custom functions for any part of the simulation process.
For example, the user may wish to generate exposure data with a custom
trend, then automate the process of generating outcomes, fitting models,
and evaluating performance using the built-in features of eesim.
Functions the user has the option to customize within the
eesim
framework are:
To use custom functions within eesim
, the user must
input the name of the custom function as well as a list of all arguments
for the custom function and their values (examples shown below). This
allows the user to pass the function and required arguments directly
within a call to the main eesim
function. When a custom
function is used, many inputs that are otherwise required for the
eesim
function may no longer be necessary, in which case
they can simply be left out of the eesim
call. As a note,
if extensive customize is required for several steps of the simulation
process, it may make more sense to code the full simulation by hand
rather than using the eesim
framework.
To take advantage of any of the customization options, you need to write a function that follows certain input and output (i.e., interface) rules. First, you can use a custom function for the underlying trend in expected exposure. This function must take the inputs:
n
(the number of days to simulate)mean
for a continuous exposure (the average
value of the outcome) or prob
for a binary exposure (the
average probability of exposure)The function can take any other additional inputs, as well, but any
such extra arguments (as well as mean
) will need to be
input to the eesim
function in a list for the
cust_expdraw_args
argument (example below). The value for
n
will pass through directly from the n
value
specified for the call to eesim
. The function must output a
numeric vector that gives the simulated exposure values for each day in
the simulated data.
For example, the following function creates a custom exposure trend
with a long-term and seasonal trend, similar to trends available through
the default package options. However, this function specifies a minimum
value that the exposure trend cannot fall below– if the
base
exposure value is every set below this
minimum
within the algorithm, the value is reset to the
minimum
before the final values are output. This function
can be useful in cases where the exposure cannot fall below a certain
value (for example, a pollution concentration could not be lower than
0). This custom exposure function can also be used to customize how
values are simulated from the expected exposure on each day (based on
the expected distribution of the exposure). In the case of the example
ozone concentration data from Chicago shown earlier in this vignette, we
may want to simulate exposure based on the assumption that the square
root of exposure is normally distributed, which will prevent negative
values and may also help to simulate occasional very high values.
above_min_trend <- function(n, mean, sd_of_sqrt, minimum = 0){
day <- c(1:n)
## Calculate a baseline exposure for each day
base <- mean + -10 * cos(2 * pi * (day / 365))
base[base < minimum] <- minimum # Reset any values below 0 to 0
## Simulate exposure values from the baseline
sqrt_base <- sqrt(base) # Transform to square root
sqrt_sim <- rnorm(n, mean = sqrt_base, sd = sd_of_sqrt)
sqrt_sim ^ 2 # Transform back
}
Here is an example of running this custom exposure simulation function over 5 years, with a smooth line added to the plot to help show the seasonal trend included:
above_min_trend(n = 365.25 * 5, mean = 20, minimum = 0, sd_of_sqrt = 0.9) %>%
tbl_df() %>%
mutate(day = 1:n()) %>%
ggplot(aes(x = day, y = value)) +
geom_point(alpha = 0.5, size = 0.8) +
theme_classic() +
geom_smooth(se = FALSE, span = 0.1, method = "loess", color = "red")
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
You can then pass this custom function into the eesim
function using the cust_exp_func
argument. The value for
n
input to the custom function will be the value you input
to eesim
for n
. For any other arguments you
want to pass to the function (in the function we just created, you’ll
want to pass values for mean
, minimum
, and
sd_of_sqrt
), you can include specifications for these as a
list for the cust_exp_args
argument of eesim
.
For example, the following call would run a simulation using this custom
function for exposure:
There are three ways to customize the simulated outcome data: creating a custom baseline for outcome values, customizing the relationship between outcome and exposure, and, as with the exposure values, customizing the randomization of the outcome values.
The outcome baseline (Bt) is comprised
of the values the user expects the outcomes to have on each day of the
simulated dataset without risk associated with the exposure factored in.
The user may write a function to specify the trend of the baseline, then
use it as an input in sim_outcome
or eesim
.
Here is an example of creating a custom baseline function and using it
in the eesim
function:
custombase <- function(n, slope, intercept){
day <- c(1:n)
baseline <- day * slope + intercept
return(baseline)
}
#Example:
custombase(n=5, slope = .3, intercept = 55)
## [1] 55.3 55.6 55.9 56.2 56.5
ex_sim3 <- eesim(n_reps = 3, n = 1000, central = 100, sd = 10,
exposure_type = "continuous", exposure_trend = "cos1",
exposure_amp = .6, average_outcome = 22, rr = 1.01,
cust_base_func = custombase,
cust_base_args = list(n=1000, slope = 5, intercept = 12),
custom_model = spline_mod, custom_model_args = list(df_year = 2))
## This function may take a minute or two to run, especially if you are
## creating lots of replications (`n_reps`).
ggplot(ex_sim3$simulated_datasets[[1]], aes(x=date, y=outcome))+ geom_point() + geom_point(alpha = 0.5, size = 0.8) +
theme_classic() +
geom_smooth(se = FALSE, span = 0.1, method = "loess", color = "red")
## `geom_smooth()` using formula = 'y ~ x'
The second way of customizing the outcome simulation is to use a
custom function to incorporate the added risk from the exposure when
calculating the expected daily outcome count for a day, λt, from the
inputs of exposure (Xt) and outcome
baseline (Bt) for the day.
Here is an example of creating a custom lambda, meaning a custom
function relating relative risk and exposure to outcomes, and using it
in eesim with the custom baseline function created above. The custom
lambda function must input arguments exposure
,
rr
, and baseline
and output a vector of lambda
values.
customlambda <- function(exposure, rr, constant, baseline){
log_lambda <- log(baseline) + log(rr) * exposure + constant
lambda <- exp(log_lambda)
return(lambda)
}
ex_sim4 <- eesim(n_reps = 3, n = 1000, central = 100, sd = 10,
exposure_type = "continuous", exposure_trend = "cos1",
exposure_amp = .6, average_outcome = 22, rr = 1.01,
cust_base_func = custombase,
cust_base_args = list(n=1000, slope = .5, intercept = 12),
cust_lambda_func = customlambda, cust_lambda_args = list(constant=10),
custom_model = spline_mod, custom_model_args = list(df_year = 2))
## This function may take a minute or two to run, especially if you are
## creating lots of replications (`n_reps`).
The third way to customize the outcome simulation is to customize the
randomization of the outcome values from the trend created by relating
the baseline outcomes and the exposure (what we have called lambda).
When the cust_outdraw
argument is not specified in the
eesim
function, the function draws outcome values from a
Poisson distribution with mean lambda. A custom function for outcome
draws must input values called n
and lambda
,
and any other arguments must be included in the
cust_outdraw_args
argument. Here is an example of using the
custom functions to specify a negative binomial distribution for outcome
randomization:
custnbinom <- function(n, lambda, prob){
out <- rnbinom(n=n, size=lambda, prob=prob)
return(out)
}
ex_sim5 <- eesim(n_reps = 3, n = 1000, central = 100, sd = 10,
exposure_type = "continuous", exposure_trend = "cos1",
exposure_amp = .6, average_outcome = 22, rr = 1.01,
cust_base_func = custombase,
cust_base_args = list(n=1000, slope = .5, intercept = 12),
cust_lambda_func = customlambda, cust_lambda_args = list(constant=10),
cust_outdraw = custnbinom, cust_outdraw_args = list(prob=.3),
custom_model = spline_mod, custom_model_args = list(df_year = 2))
## This function may take a minute or two to run, especially if you are
## creating lots of replications (`n_reps`).