I'm using Hamiltonian Monte Carlo (HMC) implementation in Edward, a probabilistic programming library built on top of TensorFlow. One of the hyper-parameters of HMC is the step size:
inference = ed.HMC({w: qw, b: qb}, data={X: X_train, y: y_train}) inference.initialize(step_size=0.5/float(N))
I know there's an adaptive version of the algorithm: the No-U-Turn Sampler (NUTS). However, I'm wondering how to initialize the step-size in the case of HMC?
Best Answer
There's a section Radford Neal's Handbook on HMC that discusses how to set the discretisation length $epsilon$ and the number of leapfrog steps $L$ appropriately: http://www.mcmchandbook.net/HandbookChapter5.pdf.
Here I can summarise the key points:
The overall distance moved is $epsilon L$ so both have to be considered.
Setting $epsilon$:
The reason proposals are rejected in HMC is purely due to discretisation error (otherwise the dynamics perfectly preserve probability density/energy).
If $epsilon$ is too large, then there will be large discretisation error and low acceptance, if $epsilon$ is too small then more expensive leapfrog steps will be required to move large distances.
Ideally we want the largest possible value of $epsilon$ that gives reasonable acceptance probability. Unfortunately this may vary for different values of the target variable.
A simple heuristic to set this may be to do a preliminary run with fixed $L$, gradually increasing $epsilon$ until the acceptance probability is at an appropriate level.
Setting $L$:
Neal says:
"Setting the trajectory length by trial and error therefore seems necessary. For a problem thought to be fairly difficult, a trajectory with L = 100 might be a suitable starting point. If preliminary runs (with a suitable ε; see above) show that HMC reaches a nearly independent point after only one iteration, a smaller value of L might be tried next. (Unless these “preliminary” runs are actually sufficient, in which case there is of course no need to do more runs.) If instead there is high autocorrelation in the run with L = 100, runs with L = 1000 might be tried next"
It may also be advisable to randomly sample $epsilon$ and $L$ form suitable ranges to avoid the possibility of having paths that are close to periodic as this would slow mixing.
Similar Posts:
- Solved – MATLAB implementaion of kernel ridge regression
- Solved – Convergence in probability: show sequence converges
- Solved – Why does the critical value varies inversely with the level of significance in hypothesis testing
- Solved – Why does the critical value varies inversely with the level of significance in hypothesis testing
- Solved – Normalized gradients in Steepest descent algorithm