Chetan PrajapatiFounder & Statistician at ISCON
This article explains bootstrap concept as a whole and discern the fundamental difference between parametric and nonparametric bootstrap using R. We will be using both-for
loops and standardboot
package
Table of Contents
Fundamental of statistical inference
We first need to understand following some fundamental concepts. The idea behind bootstrapping is very closely aligned with those concepts.
Here I have created a hypothetical study in which we are interested in finding out the average birth weight of the babies born in the UK at 37 weeks of gestation.
Experiment: To find out the average weight of babies born at 37 weeks in the UK
Random sample: Random sample means each and every single individual or object has an equal probability of being chosen from the population. For example, in our study, the population is all birth happened in the UK at 37 weeks of gestation. For typical “random sample”, we will need to make sure that each and every birth happening in any corner of the UK, will be participating in our study, from which will choose some of them randomly i.e random sample.
Sample data: the recorded observation or quantity from the sample of the population. Now we have gone to one local hospital and got data of the birth-weights from 20 babies. Here is the (hypothetical) sample data.
[1] 3331.608 3549.913 2809.252 2671.465 3383.177 3945.721 3672.601 3922.416
[9] 4647.278 4088.246 3718.874 3724.443 3925.457 3112.920 3495.621 3651.779
[17] 3240.194 3867.347 3431.015 4163.725
- Sample statistic: is the quantity of interest from sample data. The quantity of interest in our study is a mean (average). So mean is our sample statistic. So sample statistic in our sample data is3618,
[1] 3617.653
- Sampling distribution (of statistic): Now we went to take a random sample of 20 babies at numerous hospitals (i.e.multiple times) in the UK. Each time, we calculate the mean birth weight and note down its value. This value will be different in each different hospital (in a statistical sense, the statistic is a random variable, will vary as variation is inherent). Here we went to 100 hospitals and took a sample of 20 babies born at that hospital and calculated the mean birth weight and plotted in the histogram.
ggplot(data.frame(statistic), aes(statistic)) + geom_histogram() + labs(x= "Mean", y= "Freqency") + theme_ISCON()
binw <- 2*(IQR(statistic)/length(statistic)^(1/3))
ggplot(data.frame(statistic), aes(statistic)) + geom_histogram(aes(y=..density..),binwidth = binw,fill="#cf0300",colour="white") + geom_density(aes(y=..density..),size=0.5) + labs(x= "Mean", y= "Density") + theme_ISCON()
It is better to plot the histogram with optimum bin width (seeArticle from David Freedman) along with density curve usingaes(y=..density..)
.So that the area under the density curve will be 1
Population parameter: What will be the mean birth-weight of babies born in the UK if we had gone on collecting birthright of each and every single baby born in the UK. This will be only one value that what we are basically interested in. This is called the population parameter. The problem is that we want to estimate that from our limited number of observations (i.e sample data). The sample statistic will help to estimate the population parameter. The sample statistic will also be called point estimator.
Standard error(of statistic)is the standard deviation (SD) of statistic from its sampling distribution. Note that SD is a statistic that measures the variability in data relative to its mean.
Confidence interval: Though we are interested in population parameter i.e single value, what if our sample data may not contain it. We need to provide some measure of it. This can be achieved by building a confidence interval. We can say that the true value of the population lies in some interval with some degree of confidence.
So, in essence, we want to estimate a population parameter from the sampling distribution of the sample statistic. This sampling distribution can be generated by infinite time random sampling.
Bootstrapping
The bootstrap method is one type of re-sampling method, in which sample data (20 birth weights) considered as “population”.From this sample data, we re-sample it with a replacement-large number of time (the 1000’s). The resultant sampling distribution often (not always) approximate the (true) sampling distribution of the statistic.
Please note, that this sampling will be done with the replacement. So some single observation may be included many times ( or some may be left out completely). So sample statistic will be varied from each random sample of size n.
From this bootstrapped sampling distribution, which can estimate parameter value, standard errors (standard deviation of sample statistic) and then, calculate the confidence interval.
Let,
\(\theta\)is population parameter of interest which belongs to unknown population distributionF
\(\hat{\theta}\)is statistic that estimate the\(\theta\)and we are interests in sampling distribution of\(\hat{\theta}\)from fitted distribution function\(\hat{F}\)
\(\hat{\theta_0^*}\),\(\hat{\theta_1^*}\),\(\hat{\theta_2^*}\)… bootstrap estimate (statistic) to obtain sampling distribution of\(\hat{\theta^*}\)is\(\hat{F^*}\). This sampling distribution is good approximation of\(\hat{F}\)
Nonparametric bootstrap
We observed the sample data and we are unable to determine from which probability distribution function may have generated this sample data. In this situation, we use empirical distribution function (which is based on observed sample data) to simulate bootstrap samples (using Monte Carlo techniques).
# loop
theta_star <- vector()
for (i in 1:1000){
theta_star[i] <- mean(sample(df,size=length(df),replace = T))
# First take sample with replacement from obseerved data of same size & then, calculte sample statistic in each, repeat this 1000 times
}
theta_boot <- mean(theta_star) # bootstrap estimate of theta_hat
theta_boot
[1] 3616.166
boot_se <- sd(theta_star) # standard eorrs of the estimate
boot_se
[1] 102.6719
bias <- theta_boot - mean(df)
bias
[1] -1.487018
MSE <- mean((theta_boot- mean(df))^2)
MSE
[1] 2.211222
CI <-c(theta_boot-1.96*boot_se,theta_boot +1.96*boot_se)
CI
[1] 3414.929 3817.402
We can use aboot
function from theboot
package in R.It requires a function to calculate sample statistic ( in itsstatistic
argument). The function must include observed data as the first argument and indices ( or weight) in the second argument.
require(boot)
theta_star_function <- function(x,i) mean(x[i])
#
B <- boot(data = df, statistic = theta_star_function, R=1000, stype = "i" )
B
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = theta_star_function, R = 1000, stype = "i")
Bootstrap Statistics :
original bias std. error
t1* 3617.653 0.9443187 98.78257
We can used default plots to see whether bootstrap samples are correctly sampled.
plot(B)
Or, one can useggplot
to get the same figure
binw <- 2*(IQR(B$t)/length(B$t)^(1/3))
ggplot(data.frame(B$t), aes(B.t)) + geom_histogram(fill="grey", colour="black", binwidth = binw) + geom_vline(xintercept = B$t0, linetype="dashed", size=1) + labs(x="Bootstrap sample statistic", title="Bootstrap sampling distribution of sample mean") + theme_ISCON()
We can get confidence interval by,
boot.ci(B)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = B)
Intervals :
Level Normal Basic
95% (3423, 3810 ) (3421, 3819 )
Level Percentile BCa
95% (3416, 3814 ) (3407, 3804 )
Calculations and Intervals on Original Scale
Parametric Bootstrap
Once we have a sample data, we may think that the observed data may have come from some known probability distribution function ( i.e normal, gamma or poisson or any).
For example in our sample birth weights, we may assume that observed birth-weight comes from normal distribution with parameter\(\mu\)= 3617.7 and\(\sigma\)= 464.6 ( which is estimated from observed data). See the below figure,
ggplot(data.frame(df), aes(df)) + geom_density() + labs(x= "birth weight") + theme_ISCON()
So, instead of using observed data ( as a non-parametric bootstrap), we can use normal distribution function with probable parameter estimates ( which most likely the maximum likelihood estimates) for bootstrap re-sampling.
Let, first do it with help offor
loops in R.
theta_star <- vector()
for (i in 1:1000){
theta_star[i] <- mean(rnorm(length(df),mean = 3617.7 ,sd = 464.6))
}
theta_boot <- mean(theta_star) # bootstrap estimate of theta_hat
theta_boot
[1] 3613.883
boot_se <- sd(theta_star) # standard eorrs of the estimate
boot_se
[1] 101.4233
bias <- theta_boot - mean(df)
bias
[1] -3.769489
MSE <- mean((theta_boot- mean(df))^2)
MSE
[1] 14.20905
CI <-c(theta_boot-1.96*boot_se,theta_boot +1.96*boot_se)
CI
[1] 3415.094 3812.673
Now, usingboot
function,
For parametric bootstrap, one has to specify a function inran.gen
arguments, which tell the boots how random sample will be generated ( I mean, from which distribution, parameters you want to generate re-sample). The first argument will be the observed data and the second argument will be parameter estimates.This function should give random samples according to your assumed distribution function.
gen_function <- function(x,mle) {
data <- rnorm(length(x),mle[1],mle[2])
return(data)}
# function to calculate sample statistic
theta_star_function <- function(x,i) mean(x[i])
Themle
values ( of parameter estimates ) passed to random generator function.
B <- boot(data = df, sim = "parametric", ran.gen = gen_function, mle = c(3617.7,464.6), statistic = theta_star_function, R=1000)
B
PARAMETRIC BOOTSTRAP
Call:
boot(data = df, statistic = theta_star_function, R = 1000, sim = "parametric",
ran.gen = gen_function, mle = c(3617.7, 464.6))
Bootstrap Statistics :
original bias std. error
t1* 3617.653 0.05774845 101.8872
plot(B)
boot.ci(B,type = "perc")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = B, type = "perc")
Intervals :
Level Percentile
95% (3413, 3813 )
Calculations and Intervals on Original Scale