We basically had two problems. One we found quite easily (although this was not documented anywhere). The other one is still mystifying us, although we found a work-around that we don't completely understand. Let me discuss them in turn.
try-catch and tail calls
When you write a process loop function, it must be tail-recursive in order to avoid memory leaks. So you typically write something like:
server_loop(State) ->
receive
SomeMessage ->
...
server_loop(NewState);
...
end.
As you can see, the call to
server_loop is called in tail-recursive position. This is equivalent to a GOTO with parameters. This call does not consume any stack space. To make this server support hot code swapping, we just need to add a clause for a special message, and fully qualify the recursive call with the module name:server_loop(State) ->
receive
switch_code ->
?MODULE:server_loop(State);
SomeMessage ->
...
server_loop(NewState);
...
end.
So you have to make sure that all the recursive calls to the server loop have to be in tail-call position. In our application, we had something like that:
tryWe assumed that both calls to
NewState = doSomething(State),
server_loop(NewState)
catch
Error ->
server_loop(State)
end
server_loop were in tail-call position. We were wrong. Of course, the try-catch form installs an exception handler for the whole dynamic extent of its body statements. Which means that some stack space is consumed, and probably with references to the current code. Even if we add support for the switch_code message, this process will not switch to the new code and will eventually be killed.But this was easy to fix, once we figured all this out.
Use of an after clause
The other problem we had, we simply fixed it without understanding why. (If you know, please leave me a comment!!) We have a very simple process loop, coded exactly as I explained above. Unfortunately, the process get killed when we reload the code 3 times. Even if we send the
switch_code message. The fix, suggested here but without a satisfactory explanation, consisted in adding a
after clause to the receive statement:server_loop(State) ->Now we can update our whole application without service interruption. But this was not as simple as it first seemed (and advertised ;-).
receive
switch_code ->
?MODULE:server_loop(State);
SomeMessage ->
...
server_loop(NewState);
...
after
10 * 1000 ->
?MODULE:server_loop(State)
end.
Update:I posted a follow-up explaining the problem we had.
4 comments:
If the module is getting killed after a couple of reloads, it sounds like the process isn't really getting updated.
Try adding a special version/0 function that returns a unique ID string for each module you compile. Then at the top of server_loop/1, have the code send the current module version to a log file. That way you can see exactly what code you're running.
Erlang only allows three versions of any given module to be loaded at any one time.
After you load the third version, it kills any running instance on an older version.
My guess is that for some reason it's not ever getting the switch_code message.
How many processes are you running?
It sounds like you have another (lost?) process running code from that module, running version 1, that never receives the switch_code message.
All processes need to be moved from version 1 to version 2 (or killed) before you can swap to version 3.
Thanks for your help, guys! I finally noticed that we were having a process leakage somewhere. And the 'switch_code' message was only sent to the processes that we were tracking, not the others. So the 'switch_code' message was properly handled, but only by a subset of all the processes for the given module. It's the others that prevented us from upgrading code (of course, we refuse to upgrade code if erlang:check_process_code returns true for some process). I'll post anfollow up on these issues soon.
Post a Comment