“To infinity and beyond” and how to avoid a whoopsie.

I was reminded of how hard it is to predict the capacity needed for a workload when I read the news about the UK government web site which allowed you to enter your post code, and it told you what level of lock down you were in.  It crashed under the workload.  It should have been clear that within minutes of the web site being announced, a few million people would try to use it.  (It might have been better to have a static web page with all the information rather than try to provide a data base lookup).

Someone told me of a US company whose marketing company had a 1 minute commercial in the interval of the US Super-bowl competition.  They told this to the IT department a week before the game!   The audience of the game is about 100 million people.  If 1% of these people click on a web site within 2 minutes of the advert, this 10, 000 hits per second!  The typical web activity for the company was about 50 hits a second.   After the initial cries of disbelief, the IT department with the help of a large multinational IT company got the additional capacity, and hardware to do load balancing, and got through the night.

One bank said they took the average transaction rate and tested to twice this.  There was a discussion about what the average rate means.   Over a period you have highs and lows.  There is a rule of thumb (I don’t know whose thumb) which says on average, the peak is typically 3 times the average.  Within a peak period (for example 1 hour) looking at a second by second, there will be peaks within peaks.  The rule of thumb said you should plan to support 3 * 3 = (10) times the sustained average.

This bank then worked with IBM to replicate the environment within IBM and run a test to see where the bottlenecks and snags were, and ramped up the workload till they met their targets.  I remember looking at the MQ data, and found the same snags in MQ as we had spotted when we looked at their MQ system a couple of years before.  Logs were not stripped, and they had badly tuned buffer pools. 

Another customer’s test system had more capacity than the production system, so they tested weekly at production volumes + 25%. Many customer’s test systems are much smaller than production and operate on the pray system, where they hope and pray they will not have a problem.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s