Cross posted from csexchange:
Most versions of simulated annealing I've seen are implemented similar to what is outlined in the wikipedia pseudocode below:
Let s = s0
For k = 0 through kmax (exclusive):
T ← temperature( 1 - (k+1)/kmax )
Pick a random neighbour, snew ← neighbour(s)
If P(E(s), E(snew), T) ≥ random(0, 1):
s ← snew
Output: the final state s
I am having trouble understanding how this algorithm does not get stuck in a local optima as the temperature cools. If we jump around at the start while the temp is high, and eventually only take uphill moves as the temp cools, then isn't the solution found highly dependent on where we just so happened to end up in the search space as the temperature started to cool? We may have found a better solution early on, jumped off of it while the temp was high, and then be in a worse-off position as the temp cools and we transition to hill climbing.
An often listed modification to this approach is to keep track of the best solution found so far. I see how this change mitigates the risk of "throwing away" a better solution found in the exploratory stage when the temp is high, but I don't see how this is any better than simply performing repeated random hill-climbing to sample the space, without the temperature theatrics.
Another approach that comes to mind is to combine the ideas of keeping track of the "best so far" with repeated hill climbing and beam search. For each temperature, we could perform simulated annealing and track the best 'n' solutions. Then for the next temperature, start from each of those local peaks.
UPDATE: It was rightly pointed out I didn't really ask a specific question, I've updated in the comments but want to reflect it here:
I don't need a transcription of the pseudo-code into essay form, I understand how it works - how the temperature balances exploration vs exploitation, and how this (theoretically) helps avoid local optima.
My specific questions are:
- Without modification to keep track of the global best solution, wouldn't this algorithm be extremely prone to local optima, despite the temperature component? Yes, I understand the probability of taking a worse move declines as the temperature cools (e.g. it transitions to pure hill-climbing), but it would be possible that you had found a better solution early on in the exploitation portion (temp hot), jumped off of it, and then as the temperature cools you have no way back because you're on a path that leads to a new local peak.
- With the addition of tracking the global optima, I can absolutely see how this mitigates getting stuck at local peaks avoiding the problem described above. But how does this improve upon simple random search? One specific example being repeated random hill climbing. If you're tracking the global optima and you happen to hit it while in the high-temp portion, then that essentially is a random-search.
- In what circumstances would this algorithm be preferable to something like repeated random hill-climbing and why? What properties does a problem have that makes it particularly suited to SA vs an alternative.
By my understanding, simulated annealing is not guaranteed to not get stuck in a local maxima (for maximization problems), especially as it "cools" where later in the cycle as k -> kmax.
So as you say, we "jump all over the place" at the start, but we are still selective in whether or not we accept that jump, as the
P()function that determines the probability of acceptance is a function of the goal,E(). Later in the same wikipedia article, they describe theP()function a bit, and suggest that if e_new > e, then perhaps P=1, always taking the move if it is an improvement, but perhaps not always if it isn't an improvement. Later in the cycle, we aren't as willing to take random leaps to lesser results, so the algorithm tends to settle into a max (or min), which may or may not be the global.