I was wondering if anybody knows of a reliable and fast analytic HDI calculation, preferably for beta functions.
The definition of the HDI is in this question called "Highest Posterior Density Region".
I am looking for a function that has the following I/O:
Input
credMass- credible interval mass (e.g,0.95for 95% credible interval)a- shape parameter (e.g, number of HEADS coin tosses)b- shape parameter (e.g, number of TAILS coin tosses)
output
ci_min- minimum bound of mass (value between0.toci_max)ci_max- maximum bound of mass (value betweenci_minto1.)
One way to go about it is to speed up this script that I found in the same said question (adapted from R to Python) and taken from the book Doing bayesian data analysis by John K. Kruschke). I have used this solution, which is quite reliable, but it's a bit too slow for multiple calls. A speedup by 100X or even 10X would be very helpful!
from scipy.optimize import fmin
from scipy.stats import *
def HDIofICDF(dist_name, credMass=0.95, **args):
# freeze distribution with given arguments
distri = dist_name(**args)
# initial guess for HDIlowTailPr
incredMass = 1.0 - credMass
def intervalWidth(lowTailPr):
return distri.ppf(credMass + lowTailPr) - distri.ppf(lowTailPr)
# find lowTailPr that minimizes intervalWidth
HDIlowTailPr = fmin(intervalWidth, incredMass, ftol=1e-8, disp=False)[0]
# return interval as array([low, high])
return distri.ppf([HDIlowTailPr, credMass + HDIlowTailPr])
usage
print HDIofICDF(beta, credMass=0.95, a=5, b=4)
Word of caution! Some solutions confuse this HDI with the Equal Tail Interval solution (which the said question calls "Central Credible Region") which is far easier to compute, but does not answer the same question. (E.g, see Kruschke's Why HDI and not equal-tailed interval?)
Also this question does not concern the MCMC approaches that I have seen in PyMC3 (pymc3.stats.hpd(a), where a is a random variable sample), but rather regards an analytical solution.
Thank you!
Not analytical, but just in case you still find this useful. Instead of evaluating the ppf (that can be very slow for many distributions) you can find an approximate solution by evaluating the pdf, something like this.
Where
massis a number between 0 and 1. Andx_valsis a vector of equally spaced values in the support of the distribution. So for the betax_vals = np.linspace(0, 1, size)should do the work. You can control the tradeoff of accuracy and speed by setting a proper value of size for your problem. How much faster this solution is compared with the other one (using the ppf and optimization) will depend on the details of the SciPy's ppf and pdf method (that will be different for different distributions). And could even depend on the parameters of the distribution.