Fine-tuning the daily grind of computer systems can be the fix they need
At some stage, if you’re lucky, you get to work for an organisation that has an on-site coffee machine. A real coffee machine that grinds the coffee, uses real milk and is a place to gather and talk about the day’s events. Even if the coffee isn’t great, there is still the spirit of ‘if you build it, they will come’, making these coffee machines well used. You’re probably wondering what an article about performance – aside from the fact that performance analysts pretty much run on coffee – has to do with coffee machines. Stay with me.
Because the machine is easily accessible and the coffee is okay, we have a problem as there are typically more coffee addicts than coffee machines. Doubly concerning is that our coffee machine has had a few issues resulting in it taking considerably more time to grind and brew than previously.
Invariably, as with anything else, if there is more demand than there is capability to service that demand, a queue forms – and as the old adage goes, ‘time is money’. In this case, we’re waiting for a coffee machine, but it equally applies to waiting for a checkout in a supermarket, on hold with a call centre or waiting at the doctors.
This process isn’t too dissimilar to what you can see in computer systems. When you have a limited resource with too many things to process, you get a bottleneck – which usually manifests itself as ‘slowness’. In the case of our coffee machine at peak, you’re waiting for about eight minutes in the queue – and trust me, when you need coffee, that is a very long wait.
Our experience in the supermarket shows us that the more checkouts are open, the less time you spend queuing. Introducing an express lane or self-service checkouts also decreases the time you spend in a queue, because the time it takes to process each order is reduced. Essentially, you've added a different class of queue (supermarkets typically have multiple queues and multiple checkouts so the behaviour is actually a little different).
So if you are looking at reducing the time staff like me spend queuing, then the natural instinct is to either slowly wean us off caffeine or buy more coffee machines. If you are going to buy more machines, how many is enough? Is there a point at which buying more stuff starts to pay off less and less?
So as a bit of a challenge one day, while queuing for my morning brew, I decided to take down a few observations.
Firstly, I wondered what the queue looks like to someone joining it? Well, we have one queue and potentially multiple machines. Based on that we get the following diagram:
Then we can make a few measurements:
|Arrival Rate||X||0.5300||People/minute joining the queue|
|Service Time||S||1.5900||Minutes to make a coffee|
|Utilisation||U=XS||0.8427||Probability of a machine being used|
|Service Centres||m||1||Coffee machines|
|Load||p=U/m||0.8427||Probability of a coffee machine being utilised|
|Approximate Residence Time||R=s/(1 - pm)||10.1081||Minutes to wait for machine to become free and make a coffee|
|Wait Time||W=R-S||8.5181||Minutes spent in a queue|
The main things that we observe while waiting are the Arrival Rate (denoted by X), the Service Time (S) and the number of queuing centres, aka Coffee Machines (q).
(For a detailed treatment of how to attain and derive the queuing formulae refer to: Gunter, Neil. J., The Practical Performance Analyst (2000). McGraw Hill).
According to Little’s Law, the probability of a queuing centre being utilised is calculated by dividing the Utilisation (throughput x service time) by the number of queuing centres. The other concept to be introduced is Residence Time. Residence Time is the total time you spend in the system. In this case, this is the time you spend waiting to get to the coffee machine in addition to how long it takes you to make the coffee. All figures used in the calculations are based on averages taken at peak load.
According to the calculations, we are waiting (W) for 8.5181minutes which is fairly close to the observation of spending ‘about eight minutes’ in the queue.
Two things are in play here with the application of Little’s Law: the larger the value of q becomes, the lower the probability of Utilisation – that is, the less chance there is that the coffee machine you are going to will be used.
The second thing is that lower the probability of a coffee machine being used, the lower your Residence Time and the lower your Wait Time.
It’s tempting to assume then that the more coffee machines you have, the less time you’ll spend queuing. And that is true, up to a point. Fortunately, implementing the calculations in Excel is pretty simple and quickly shows that the relationship between the number of coffee machines and the residence time isn’t linear:
Looking at the actual numbers from calculating out the variables, the biggest drops in our Residence Time occur when we go from one to two coffee machines and then from two to three. Every additional machine after that shows a diminishing return:
|Coffee Machines||Load (p=U/m)||R (minutes)||W (W=R-S) (minutes)|
The more Service Centres (m) you have, the closer your Residence Time gets to your Service Time – until eventually there is no queuing at peak. Practically though, when you are looking at dropping several thousand on a coffee machine, it’s a good idea to look at where the return in saved time is outweighed by the cost. Adding more hardware to a problem will help, but only up to a point. With our coffee machines, adding another one machine will help considerably. After that, you’re probably better to look at making the coffee-making process itself more efficient and reduce the time it takes to grind, brew and pour to cut down on the 1.59 minutes that we are always going to be in the system for (implement the model in Excel and change the Service Time up and down. The results are quite surprising).
The same applies to computer systems. Sure, adding more hardware will probably help in the sense that it will buy you more time before you hit contention for resources. But there is a trade-off in terms of cost and efficiency. A poorly written query, a badly indexed database or a host of expensive out-of-process calls to different components will benefit more from tuning than throwing more servers, CPU and memory at the problem – and it will be cheaper and hopefully more scalable in the long run.
As a final thought, we did get another coffee machine and sure enough the Residence Time in the system dropped considerably – as predicted by the model. Until that is, we realised that it makes really, really bad coffee.