Python: Bootstrapping and Resampling

Bootstrapping

Uses sampling with replacement.

This Notebook closely follows NCL's example.

NCL (6.4.0 ) currently has four bootstrap statistic functions and one bootstrap utility function:
bootstrap_correl
bootstrap_diff
bootstrap_estimate ; utility
bootstrap_regcoef
bootstrap_stat
Some NCL bootstrap functions allow for subsampling (n<N) and for sequential blocks (sequences) of values. More involved sampling strategies may require custom codes. A user could extract one of the above core functions from the contributed.ncl library and modify as needed.

Here we will implement replacements for those NCL functions.

The examples covered are:

The functions that are included here are:

Functionality from python and libraries that is used:

Data:

Always check the (NCL data list)[https://www.ncl.ucar.edu/Applications/Data/] to see if the example data is available.

resampling_1.ncl

These distributions of sampling indices illustrate properties of resampling with replacement from a uniform distribution using generate_sample_indices. Clearly, a reasonably large N is needed to 'reasonably' sample all combinations.

NCL's resampling_1.ncl example is slightly confusing, at least partly because NCL has some idiosyncratic defaults.

The NS array is not just the sample size, but it is also being used to specify the interval from which integers are being drawn. I think this is to be thought of as like drawing indices from a sample with NS entries. This can be a little confusing because in the loop, NCL is generating arrays of indices of size NS drawn from the range [0,NS].

The NCL function generate_sample_indices uses random_uniform. In our implementation, we will use the modern Numpy infrastructure that uses integers to generate integer random numbers from a discrete uniform distribution. Another advantage her is that the Generator also has a choice method allowing us to replicate NCL's functionality with no effort.

The NCL example produces 3 plots that are just realizations of the same process. While I guess that shows that you don't get the same answer every time, I don't think it is necessary. Our example will make one instance of the 8-panel plot.

A simple example of using sampling with replacement to derive a bootstrapped mean and 95% confidence limits. Here x(N):

   nBoot =   10000                           
   xBoot = new (nBoot, typeof(x))

   do ns=0,nBoot-1                        ; generate multiple estimates
      iw = generate_sample_indices(N,1)) ; indices with replacement
      xBoot(ns) = dim_avg_n( x(iw), 0 )  ; compute average 
   end do

   xAvgBoot = dim_avg_n(xBoot,0)         ; Averages of bootstrapped samples
   xStdBoot = dim_stddev_n(xBoot,0)      ; Std Dev  "        "        "
   xStdErrBoot = xStdBoot/nBoot           ; Std. Error of bootstrapped estimates

   ia = dim_pqsort_n(xBoot, 2, 0)        ; sort bootstrap means into ascending order

   n025     = round(0.025*(nBoot-1),3)    ; indices for sorted array
   n500     = round(0.500*(nBoot-1),3)                             
   n975     = round(0.975*(nBoot-1),3)

   xBoot_025= xBoot(n025)                 ;  2.5% level
   xBoot_500= xBoot(n500)                 ; 50.0% level  (median)
   xBoot_975= xBoot(n975)                 ; 97.5% level

Note: since 'x(N)' is rank one, the following functions could have been used in place of the dim_*_n functions: avg, stddev and qsort. Of course, the arguments would have to change accordingly.

bootstrap_stat_1.ncl

These illustrate various properties of the mean (opt@stat=0) using different bootstrap sampling parameters.

Plots

I won't include the NCL code for the plots, but it is found here. They make three versions using the two different data sets, but the plotting part is all the same.

For the "legend", I decided to use a multiline string. Those are triple-quoted strings in python. These ones are f-strings to put the variables into the strings for evaluation (this is called literal string interopolation). There is some trial and error in placing text like this, but I decided to put the text at the same height in each axes by getting the vertical axis extent and positioning the text at a specified fraction within that range.

One place where Matplotlib does not do a great job by default is text sizes. Here I show how to make the text bigger than default using fontsize and the title is bold using fontweight. I also changed the tick label font size using tick_params which is pretty handy.

bootstrap_stat_2.ncl

Bootstrap January monthly mean temperatures (degF; left histogram pair) and May monthly precipitation totals (inches) totals (right histogram pair) for Boulder, CO for the period 1897-2014. Estimate various properties of the mean. The total sample size for each month is N=114. Here, 30-year sub-samples (n=30) are are generated nBoot=1000 times.

The data used for this example is on the NCL website, but this isn't an OpenDAP server, so we need to download the file. Just for completeness, we will do that right here. There are a TON of ways to do it, but we will use requests which is one of the most popular python libraries in the world.

At this point there does not seem to be anything different about the precipitation example than the LSAT one. If I were doing analysis that repeated code like in these examples, I would write a function to reduce the work. Let's write a function that will take our data and our bootstrapped distribution and make the plot we made before.

This function is a bit clumsy, but gets the job done. I'd usually try to separate the computation into one function and the plotting into another. If we were going to be doing this operation frequently, it might be worth making a class that would hold all the statistical data, making it easier to let the plot know what the sample size is etc.

bootstrap_correl_1.ncl

These estimate the correlation coefficient between the 82-school LSAT and GPA using classical statistics and via the bootstrap method.

bootstrap_correl_2.ncl

Discrepancy with NCL?

This picture does not match NCL's. What's going on?

To investigate, I re-ran the NCL example. First, the file we used from NOAA is definitely different from the original. In the original example, the format was short, and in the new file it is float. Also, the new file is [time, lat, lon] while the NCL example had another dimenison (lev?).

When I re-ran the NCL script with the versions of the files used here, the difference is harder to discern. One issue with the visual inspection is that the colormap used for NCL is very strange (precip4_diff_19lev). The colors change inconsistently through the colormapping, with adjacent levels sometimes being very similar and sometimes very different. To correct for this, I switched the NCL colors to MPL_RdBu and changed the colors above to RdBu. Visually inspecting, the two versions seem qualitatively the same, though it looks like some wiggles and contours still differ.

The slight differences are not surprising, especially in the details of capturing contour shapes.

Additional checks on some of the statistics show consistent results when using the same samples. Again, note that the NCL script seems to use 1948-2011, but the figures that it shows are for 1948-1986. That makes a big difference.

Timing of NCL vs. Python

Timing-wise, since I re-ran the NCL version, it is worth noting it took 81s to run on an 2016 MacBook Pro.

Doing simple %%time cell magics in the above,

There could be some things that aren't fair about this comparison, so I don't take a near 2x difference seriously. But it is worth noting that this includes getting the files from the internet (though when I timed it, it might have been cached, which could make a difference). For the NCL example, I separately downloaded the file from NOAA.

My NCL version is 6.6.2.

bootstrap_regcoef_1.ncl

These estimate the linear regression coefficient between the LSAT and GPA. Additional information based upon the original sample is provided via regline_stats.

The NCL example uses bootstrap_regcoef. That function is a bootstrapping wrapper for regCoef_n.

There are so many approaches to linear regression that it is difficult to decide how to try to reproduce NCL's functionality. See (Real Python)[https://realpython.com/linear-regression-in-python/] for a good overview. If statsmodels is installed, it appears to be a comprehensive replacement for regCoef, regline, and regline_stats for the 1-dimensional case. Since regline_stats actually only provides the 1-d case, this is likely the most complete approach. For this example, I didn't want to add an additional dependency, so we will use SciPy again, and it will be adequate for all the needs here.