Thursday, September 11, 2008

Hot code swapping pitfalls in Erlang

Hot code swapping is a very nice feature of Erlang. (Please, don't bother commenting that this can be done in other languages too... I already know.) But there is little documentation about it, making everybody believe that it is trivial to support in every application. François and I had a hard time today trying to get hot code swapping run smoothly in one of our applications.

We basically had two problems. One we found quite easily (although this was not documented anywhere). The other one is still mystifying us, although we found a work-around that we don't completely understand. Let me discuss them in turn.

try-catch and tail calls

When you write a process loop function, it must be tail-recursive in order to avoid memory leaks. So you typically write something like:
server_loop(State) ->
receive
SomeMessage ->
...
server_loop(NewState);
...
end.

As you can see, the call to server_loop is called in tail-recursive position. This is equivalent to a GOTO with parameters. This call does not consume any stack space. To make this server support hot code swapping, we just need to add a clause for a special message, and fully qualify the recursive call with the module name:
server_loop(State) ->
receive
switch_code ->
?MODULE:server_loop(State);
SomeMessage ->
...
server_loop(NewState);
...
end.

So you have to make sure that all the recursive calls to the server loop have to be in tail-call position. In our application, we had something like that:
     try
NewState = doSomething(State),
server_loop(NewState)
catch
Error ->
server_loop(State)
end
We assumed that both calls to server_loop were in tail-call position. We were wrong. Of course, the try-catch form installs an exception handler for the whole dynamic extent of its body statements. Which means that some stack space is consumed, and probably with references to the current code. Even if we add support for the switch_code message, this process will not switch to the new code and will eventually be killed.

But this was easy to fix, once we figured all this out.

Use of an after clause

The other problem we had, we simply fixed it without understanding why. (If you know, please leave me a comment!!) We have a very simple process loop, coded exactly as I explained above. Unfortunately, the process get killed when we reload the code 3 times. Even if we send the switch_code message. 

The fix, suggested here but without a satisfactory explanation, consisted in adding a after clause to the receive statement:

server_loop(State) ->
receive
switch_code ->
?MODULE:server_loop(State);
SomeMessage ->
...
server_loop(NewState);
...
after
10 * 1000 ->
?MODULE:server_loop(State)
end.
Now we can update our whole application without service interruption. But this was not as simple as it first seemed (and advertised ;-). 

Update:I posted a follow-up explaining the problem we had.

4 comments:

Anonymous said...

If the module is getting killed after a couple of reloads, it sounds like the process isn't really getting updated.

Try adding a special version/0 function that returns a unique ID string for each module you compile. Then at the top of server_loop/1, have the code send the current module version to a log file. That way you can see exactly what code you're running.

Paul Bonser said...

Erlang only allows three versions of any given module to be loaded at any one time.

After you load the third version, it kills any running instance on an older version.

My guess is that for some reason it's not ever getting the switch_code message.

Philip Robinson said...

How many processes are you running?

It sounds like you have another (lost?) process running code from that module, running version 1, that never receives the switch_code message.

All processes need to be moved from version 1 to version 2 (or killed) before you can swap to version 3.

Dominique Boucher said...

Thanks for your help, guys! I finally noticed that we were having a process leakage somewhere. And the 'switch_code' message was only sent to the processes that we were tracking, not the others. So the 'switch_code' message was properly handled, but only by a subset of all the processes for the given module. It's the others that prevented us from upgrading code (of course, we refuse to upgrade code if erlang:check_process_code returns true for some process). I'll post anfollow up on these issues soon.