Wednesday 23 August 2006

Sometimes where code runs is more important than what it is

Sometimes, to get the best performance from some code, you have to change the architecture.

Our application server product, Meteor Server is a complex beast. To be able to handle requests from clients concurrently, the main MeteorServer.exe process farms out those requests to a pool of worker processes. (Yes, we could switch to a single multi-threaded worker process even with VB6, but it’s a lot of effort and many existing applications may not be threadsafe, so we’d have to offer both schemes, and that’s even more effort.)

We can’t multi-thread the main MeteorServer.exe process because it’s written in VB6, and while you can make an out-of-process COM server process (a local server in COM parlance, an ‘ActiveX EXE’ in VB6 terminology) multi-threaded, you can’t make a ‘Standard EXE’ multithreaded. Oh, there are hacks, but I’m of the firm opinion that you shouldn’t subvert a technology to make it do something it wasn’t designed to do – when it goes wrong you will get no support.

A Meteor application is a COM object which exposes two methods through a dispatch (Automation) interface – well, one property and one method. The property, called VersionString, is simply to allow Meteor to pick up and display version information for the application. Every other piece of interaction is done through the TerminalEvent method, which receives a couple of interface pointers to allow it to call back into Meteor, a flag indicating whether this is a new client, a numeric event type indicating what the user’s last action was, and a string representing any event data. The application then calls methods on the interface to accumulate a batch of commands to be sent to the client – things like clearing the screen, setting the text colour, displaying text at a given location, sending a menu of options, defining an entry field. When one of these methods is called, it’s turned into an on-the-wire format, with an operation code and a wire-representation of the parameters. When the application returns from TerminalEvent, the server sends the complete batch to the client.

When I first started working on Meteor Server, when the application called a command-generating method, the stub of code in the interface made a call back into the MeteorServer.exe process to perform the wire-format translation. This meant it had to wait for the server process to finish whatever it was doing and go back to waiting for a window message. This made the server process a serious bottleneck – it was a ‘chatty’ interface, which is really not advisable across process boundaries. About three years ago, I looked at the code and realised it actually had no dependencies on any data in the server process, and had the idea to move this formatting code into the worker process that the application object was running in, to improve both performance and scalability. The commands would be batched up in the worker process then only sent across to the server when the batch was complete.

About a year later I actually made the change – we were seeking a significant performance improvement at the time. I don’t have a record of the performance change but I think it was some decent multiple – 3x or so the transaction rate.

About this time last year, or a little before, I was asked to add a new feature. Meteor provides a session state storage object which can store arbitrary strings that the application sets. The new feature was to allow an application to copy the session state data from another session to its own – this allows a user to resume their work on a different client, for example if the hardware is damaged or otherwise fails. I initially put the extra code directly in the batch-retire method that the worker process calls on completing a request, adding a new parameter, but when testing for performance, discovered that the simple test to see if the session should be transferred caused a regression of about 10%, and that would mean the difference between exceeding and failing to meet the performance requirement for a different customer.

The solution was to make this a separate method that the worker process would call if required, and to take it out of the mainline, returning the batch-retire method to its previous implementation. On doing this, the regression had gone – we were back up to almost exactly the same performance level we’d had before.

Know the environment in which your code has to run.

No comments: