Saturday 27 November 2010

The mysterious case of the slow Alpha

One of the questions I like to ask candidates during the interview process is, "What is the most memorable problem you have had to solve". Often I am disappointed in the answers I get and thats with candidates with 10+ years of experience who I would have expected to come across some gnarly problem they had to solve.

Thinking about the question from my own experience I have several answers, but one in particular is now over 10 years old I thought was worth documenting before it fades from my memory.

In the late 90's and early 00's I worked in a startup in the telephony space, or more correctly the computer telephony integration space. It was a niche that was growing fast and one we were doing very well in. We had secured a contract with DEC (just as they merged with Compaq and since merged with HP) to port our software onto DECUnix (later Tru64). We had been running on Intel based systems for sometime using SCOUnix, Linux and Solaris. I had joined the team just after the port had been done, we had a couple of systems live and a couple running in labs.

There were some concerns raised on performance during integration testing with a client, and we embarked on some benchmarking. We pitted a brand new DS-20 on loan from DEC against our latest and greatest PIII-500 - in theory the DS-20 should have left the P-III in its dust. But it didn't, it didn't even come close to matching it, in fact in some tests it came in at almost 50% slower.

There was much gnashing of teeth. We were relying on the higher performance of the DS-20 to drive our SS7/ISUP solution into new markets at a scale our competitiors couldn't match.

I was still pretty early in my carreer, wet behind the ears if you like. I set about looking at what changes had been made during the port, and after a few days could not find any significant changes, those which were made I did some micro-benchmarks on and couldn't find any issue. We tried tuning the DECUnix kernel, we spoke to Compaq and adjusted everything to what was considered the optimum for our situation, no real change.

Tempers were rising, client was threating to pull the plug, my boss and I were spending hours on end, stepping through code, testing, tuning all to no avail. Eventually our CTO rang Compaq and gave them a barrage about how **** their Alpha's were and how we were so dissappointed with their performance we were going to drop our port and withdraw from the contract. After some too'ing and fro'ing Compaq flew an engineer out to take a look at our benchmarks and to see if they could help us out.

So the engineer arrived, I walked him through our tests, ran the benchmark and showed him the outputs. He looked over the tuning we'd done and agreed we'd done all the right things. I asked if he'd like to see the code, he said not yet, instead lets profile it. Profile? I'd not really thought of doing that on the Alpha, we'd profiled on the intel, ok so how can we do that on DECUnix? So he spent a few minutes showing me how to do it, I did a fresh build with profiling enabled, and then we re-ran the benchmark with profiling on. As expected it was slower still, but we generated a profile.

Right lets take a look.

The engineer looks at the results and within 30 seconds he said, 'Any reason you are opening /dev/null 11 million times during that test?'....huh, WTF?!

I looked through the application code, no references to '/dev/null' - it must be from somewhere else. Some find, strings and grep later we found the culprit, a shared library with no much in it, but it wasn't a DEC library. A search through our CVS repository, and I found a small module written by my predecessor who left a few days after I started - just after he'd finished setting up the new server.

In that module was a single function snprintf ... and a lot of comments abusing DEC for not including this function in their standard libraries......

And in that function was the reference to /dev/null

He had implemented his own snprintf, and thought he'd done a smart job. He was using the fact that fprintf would return the length of what was outputted, so in order to determine the length of the input he would open /dev/null, fprintf it, get the length, close /dev/null and then use that length to determine if the input needed truncation before calling sprintf. Oh crap.

Some quick hacking, a fresh build, and bang our Alpha was now screaming along just under twice as fast as the Intel.

Red faced, I thanked the engineer, and slunk off to explain to the CTO where the problem was. A day or so later we shipped a new version to our client using the GNU library instead of our insane one.

Contracts saved, faces red and a lot learnt.

3 comments:

Olle Kullberg said...

Thanks Glen. Nice to hear a real life story about the essence of IT-work: begging for forgiveness :-)

I recognize this pattern:

* Before you know the reason, the behavior of the system is "impossible".

* After you know the reason, the behavior is "obvious".

And as soon "impossible" has turned into "obvious" the whole thing is forgotten (could have something to do with beer consumption at the release party), i.e. no lesson learned.

Glen said...

Hi Olle, Yes it is an all too common pattern.

I think thats one of the reasons that Dan North's concepts around Deliberate Discovery resonate so well with many people.

Christian Catchpole said...

Once we wrote a HTTPS session proxy which worked all sorts of magic. Except that for some reason we could never proxy Amazon. Weeks of debugging to discover... a POST request with ? GET parameters. Our proxy was TOO compliant. But my favourite, but quick to fix, was a junior dev who asked me "why are my 5 lines of coding taking 20 seconds to open a 100K file?". It was a character by character read loop with: string = string + character. Sigh... :)