What services does ISCON Statistical Consulting Services offer?

ISCON provide a wide range of statistical consulting and analysis services which include but not limited to the selection of research design, data analysis and results interpretation. We can also assist you in designing, conducting and analysing survey research. Our expert assistance is available for an extensive range of areas including grant proposal development, clinical trial design, scientific research design, strategies for data collection, determination of sample size and power, statistical modelling, results interpretation, preparation of manuscripts, data entry and management to some extent. We are specialists in epidemiology studies, clinical trials, statistical genetics, non-parametric methodology, genetic epidemiology, regression and time series analysis, Bayesian methods and graphical methodologies. Moreover, we offer support to non-profit organisations for NSF, private foundation and NIH grants.

How can I book appointment with Statistical consultant?

You can contact us via email or phone call to book your appointment for consultation

What information should I bring along at an initial consultation meeting with statistician?

When you visit us for the initial meeting, it is recommended that you bring along all the information related to your research and can help us in better understanding the goals and purpose of your research study. This information can include research hypotheses, relevant literature review papers, proposal drafts or manuscripts and anything you think is important for us to know about. If you have done data collection already, we request you to give us your data set copy in electronic form along with all the relevant study information.

How much statistics should I know?

It is expected that you at least have basic knowledge of statistical concepts and methodologies. However, our expert consultant helps you to explore the available options.

How much is the cost to hire statistician?

The cost of our service depends on a variety of factors which include your data format, complexity, data cleanliness, project deliverables and required analysis. We work with you to understand your requirements and breakdown the cost to meet your individual needs.

What happens at the initial consultation session with statistician?

In the initial consultation session, the consultant introduces him/herself and asks you a few general questions about your research, data, objectives and requirements. Through this, we try to understand your needs and work with you to answer your concerns.

What is the cost of the initial consultation?

At ISCON Statistics, we provide a free initial consultation at our office or through video-calling. However, if you request us to come over to you, then you have to cover our trip expenses.

At what research stage should I contact statisticians?

The best time to contact statistician is before designing your research study. We can help you in designing your research study and determine the most effective and powerful statistical methodologies. However, you can contact us at any stage of your research.

Which statistical software statistician will use for data analysis?

Our experts have hands-on experience of using a variety of latest statistical software and tools, including R, WinBUGS, SPSS, SAS, and Stata.

What is your approach towards data privacy, security and confidentiality?

At ISCON Statistics, we follow strict data security measures and respect your privacy and confidentiality. We never share your information and data with anyone without your permission.

Are your services available for students?

Yes, we happily provide statistical support to the postgraduate students who do not have statistical expertise. We either provide them statistical advice or guidance to help them perform their statistical analysis or perform the statistical analysis of their research.

What format should I use to provide you data?

You can send us data in any format. However, CSV files, Excel files or SQL-based data files are recommended.

Bootstrapping in Statistics : Difference between Parametric and Nonparametric Bootstrap method

Chetan Prajapati

Founder & Statistician at ISCON

This article explains bootstrap concept as a whole and discern the fundamental difference between parametric and nonparametric bootstrap using R. We will be using both-forloops and standardbootpackage

Table of Contents

Fundamental of statistical inference
Bootstrapping
Nonparametric bootstrap
Parametric Bootstrap

Fundamental of statistical inference

We first need to understand following some fundamental concepts. The idea behind bootstrapping is very closely aligned with those concepts.

Here I have created a hypothetical study in which we are interested in finding out the average birth weight of the babies born in the UK at 37 weeks of gestation.

Experiment: To find out the average weight of babies born at 37 weeks in the UK
Random sample: Random sample means each and every single individual or object has an equal probability of being chosen from the population. For example, in our study, the population is all birth happened in the UK at 37 weeks of gestation. For typical “random sample”, we will need to make sure that each and every birth happening in any corner of the UK, will be participating in our study, from which will choose some of them randomly i.e random sample.
Sample data: the recorded observation or quantity from the sample of the population. Now we have gone to one local hospital and got data of the birth-weights from 20 babies. Here is the (hypothetical) sample data.

 [1] 3331.608 3549.913 2809.252 2671.465 3383.177 3945.721 3672.601 3922.416
 [9] 4647.278 4088.246 3718.874 3724.443 3925.457 3112.920 3495.621 3651.779
[17] 3240.194 3867.347 3431.015 4163.725

Sample statistic: is the quantity of interest from sample data. The quantity of interest in our study is a mean (average). So mean is our sample statistic. So sample statistic in our sample data is3618,

[1] 3617.653

Figure 1: Parametric

Sampling distribution (of statistic): Now we went to take a random sample of 20 babies at numerous hospitals (i.e.multiple times) in the UK. Each time, we calculate the mean birth weight and note down its value. This value will be different in each different hospital (in a statistical sense, the statistic is a random variable, will vary as variation is inherent). Here we went to 100 hospitals and took a sample of 20 babies born at that hospital and calculated the mean birth weight and plotted in the histogram.

ggplot(data.frame(statistic), aes(statistic)) + geom_histogram() + labs(x= "Mean", y= "Freqency") + theme_ISCON()

binw <- 2*(IQR(statistic)/length(statistic)^(1/3))
ggplot(data.frame(statistic), aes(statistic)) + geom_histogram(aes(y=..density..),binwidth = binw,fill="#cf0300",colour="white") + geom_density(aes(y=..density..),size=0.5) + labs(x= "Mean", y= "Density")  + theme_ISCON()

Figure 2: Default histogram (left) and histogram with proper binwidth(right)

It is better to plot the histogram with optimum bin width (seeArticle from David Freedman) along with density curve usingaes(y=..density..).So that the area under the density curve will be 1

Population parameter: What will be the mean birth-weight of babies born in the UK if we had gone on collecting birthright of each and every single baby born in the UK. This will be only one value that what we are basically interested in. This is called the population parameter. The problem is that we want to estimate that from our limited number of observations (i.e sample data). The sample statistic will help to estimate the population parameter. The sample statistic will also be called point estimator.
Standard error(of statistic)is the standard deviation (SD) of statistic from its sampling distribution. Note that SD is a statistic that measures the variability in data relative to its mean.
Confidence interval: Though we are interested in population parameter i.e single value, what if our sample data may not contain it. We need to provide some measure of it. This can be achieved by building a confidence interval. We can say that the true value of the population lies in some interval with some degree of confidence.

So, in essence, we want to estimate a population parameter from the sampling distribution of the sample statistic. This sampling distribution can be generated by infinite time random sampling.

Bootstrapping

The bootstrap method is one type of re-sampling method, in which sample data (20 birth weights) considered as “population”.From this sample data, we re-sample it with a replacement-large number of time (the 1000’s). The resultant sampling distribution often (not always) approximate the (true) sampling distribution of the statistic.

Please note, that this sampling will be done with the replacement. So some single observation may be included many times ( or some may be left out completely). So sample statistic will be varied from each random sample of size n.

From this bootstrapped sampling distribution, which can estimate parameter value, standard errors (standard deviation of sample statistic) and then, calculate the confidence interval.

Let,

$θ$ is population parameter of interest which belongs to unknown population distributionF
$\hat{θ}$ is statistic that estimate the $θ$ and we are interests in sampling distribution of $\hat{θ}$ from fitted distribution function $\hat{F}$
$\hat{θ_{0}^{*}}$ , $\hat{θ_{1}^{*}}$ , $\hat{θ_{2}^{*}}$ … bootstrap estimate (statistic) to obtain sampling distribution of $\hat{θ^{*}}$ is $\hat{F^{*}}$ . This sampling distribution is good approximation of $\hat{F}$

Nonparametric bootstrap

We observed the sample data and we are unable to determine from which probability distribution function may have generated this sample data. In this situation, we use empirical distribution function (which is based on observed sample data) to simulate bootstrap samples (using Monte Carlo techniques).

# loop
theta_star <- vector()
for (i in 1:1000){
  theta_star[i] <- mean(sample(df,size=length(df),replace = T))
  # First take sample with replacement from obseerved data of same size & then, calculte sample statistic in each, repeat this 1000 times
}

theta_boot <- mean(theta_star) # bootstrap estimate of theta_hat
theta_boot

[1] 3616.166

boot_se <- sd(theta_star) # standard eorrs of the estimate
boot_se

[1] 102.6719

bias <- theta_boot - mean(df)
bias

[1] -1.487018

MSE <- mean((theta_boot- mean(df))^2)
MSE

[1] 2.211222

CI <-c(theta_boot-1.96*boot_se,theta_boot +1.96*boot_se)
CI

[1] 3414.929 3817.402

We can use abootfunction from thebootpackage in R.It requires a function to calculate sample statistic ( in itsstatisticargument). The function must include observed data as the first argument and indices ( or weight) in the second argument.

require(boot)
theta_star_function <- function(x,i) mean(x[i])
#
B <- boot(data = df, statistic = theta_star_function, R=1000, stype = "i" )
B


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = df, statistic = theta_star_function, R = 1000, stype = "i")


Bootstrap Statistics :
    original    bias    std. error
t1* 3617.653 0.9443187    98.78257

We can used default plots to see whether bootstrap samples are correctly sampled.

plot(B)

Figure 3: Bootstrap sampling distribution of sample mean

Or, one can useggplotto get the same figure

binw <- 2*(IQR(B$t)/length(B$t)^(1/3))
ggplot(data.frame(B$t), aes(B.t)) + geom_histogram(fill="grey", colour="black", binwidth = binw) + geom_vline(xintercept = B$t0, linetype="dashed", size=1) + labs(x="Bootstrap sample statistic", title="Bootstrap sampling distribution of sample mean")  + theme_ISCON()

Figure 4: Bootstrap sampling distribution of sample mean (ggplot2)

We can get confidence interval by,

boot.ci(B)

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = B)

Intervals : 
Level      Normal              Basic         
95%   (3423, 3810 )   (3421, 3819 )  

Level     Percentile            BCa          
95%   (3416, 3814 )   (3407, 3804 )  
Calculations and Intervals on Original Scale

Parametric Bootstrap

Once we have a sample data, we may think that the observed data may have come from some known probability distribution function ( i.e normal, gamma or poisson or any).

For example in our sample birth weights, we may assume that observed birth-weight comes from normal distribution with parameter $μ$ = 3617.7 and $σ$ = 464.6 ( which is estimated from observed data). See the below figure,

ggplot(data.frame(df), aes(df)) + geom_density() + labs(x= "birth weight")  + theme_ISCON()

Figure 5: Distribution of observed sample birthweights

So, instead of using observed data ( as a non-parametric bootstrap), we can use normal distribution function with probable parameter estimates ( which most likely the maximum likelihood estimates) for bootstrap re-sampling.

Let, first do it with help offorloops in R.

theta_star <- vector()
for (i in 1:1000){
  theta_star[i] <- mean(rnorm(length(df),mean = 3617.7 ,sd = 464.6))
}

theta_boot <- mean(theta_star) # bootstrap estimate of theta_hat
theta_boot

[1] 3613.883

boot_se <- sd(theta_star) # standard eorrs of the estimate
boot_se

[1] 101.4233

bias <- theta_boot - mean(df)
bias

[1] -3.769489

MSE <- mean((theta_boot- mean(df))^2)
MSE

[1] 14.20905

CI <-c(theta_boot-1.96*boot_se,theta_boot +1.96*boot_se)
CI

[1] 3415.094 3812.673

Now, usingbootfunction,

For parametric bootstrap, one has to specify a function inran.genarguments, which tell the boots how random sample will be generated ( I mean, from which distribution, parameters you want to generate re-sample). The first argument will be the observed data and the second argument will be parameter estimates.This function should give random samples according to your assumed distribution function.

gen_function <- function(x,mle) {
  data <- rnorm(length(x),mle[1],mle[2])
  return(data)}

# function to calculate sample statistic 
theta_star_function <- function(x,i) mean(x[i])

Themlevalues ( of parameter estimates ) passed to random generator function.

B <- boot(data = df, sim = "parametric", ran.gen =  gen_function, mle = c(3617.7,464.6), statistic = theta_star_function, R=1000)
B


PARAMETRIC BOOTSTRAP


Call:
boot(data = df, statistic = theta_star_function, R = 1000, sim = "parametric", 
    ran.gen = gen_function, mle = c(3617.7, 464.6))


Bootstrap Statistics :
    original     bias    std. error
t1* 3617.653 0.05774845    101.8872

plot(B)

Figure 6: Parametric bootstrap using boot package

boot.ci(B,type = "perc")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = B, type = "perc")

Intervals : 
Level     Percentile     
95%   (3413, 3813 )  
Calculations and Intervals on Original Scale

Bootstrapping in Statistics : Difference between Parametric and Nonparametric Bootstrap method

Fundamental of statistical inference

Bootstrapping

Nonparametric bootstrap

Parametric Bootstrap

Analytics cookies

Functining Cookies

Marketing cookies

Fundamental of statistical inference

Bootstrapping

Nonparametric bootstrap

Parametric Bootstrap

Cookie Settings

Analytics cookies

Functining Cookies

Marketing cookies