What should I monitor for MQ on z/OS – buffer pool

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the buffer pools, as you sit watching the monitoring screen.

A quick overview of MQ buffer pools

An inefficient queue manager could have all messages stored on disk, in page sets.  An efficient queue manager would cache hot pages in memory to they can be accessed without doing any I/O.

The MQ Buffer Manager component does this caching, by using buffers  (so no surprise there).

A simplified view of the operation is as follows

  1. Messages are broken down into 4KB pages.
  2. Getting the contents of a pageset page, if the page is not in the buffers, the buffer manager reads it from disk into a buffer.
  3. If a page is requested and the contents are not required (for example it is about to be overwritten as part of a put) it does not need to read it from disk.
  4. If the page is updated, for example a new message, or a field in the message is updated during get processing,  the page is not usually written to disk immediately.   The write to disk is deferred (using a process called Deferred Write Process – another non surprise in the naming convention).  This has the advantage that there can be many updates to a page before it is written to disk.
  5. Any buffer page which has not been changed, and is not currently being used, is a free page.

If the system crashes, non persistent messages are thrown away, and persistent messages can be rebuilt from the log.

To reduce the restart time, pages containing persistent data are written out to the page set periodically.   This is driven by the log data set filling up which causes a checkpoint.  Updates which have been around for more than 2 checkpoints are written to disk.  During restart the page set knows how far back in the log restart needs to go.

In both cases, checkpoint and buffer pool filling up (when there are less than 15 % of free =85% in use) , once a page has been successfully written to the page set, the buffer is marked as free.

Pages for non-persistent messages can be written out to disk.

If the buffer pool is critically short of free buffers, and there are less than 5% free buffers, then pages are written to the page set immediately rather than use the deferred write process.  This allows some application work to be done while the buffer manger is trying to make more free pages available.

What is typical behaviour?

The buffer pool is working well.

When a message is got, the buffer is already in the buffer pool so there is no I/O to read from the page set.

The buffer pool is less than 85% full (has more than 15% free pages), there is periodic I/O to the page set because pages with persistent data are written to the page set at checkpoint.

The buffer pool fills up and has more 85% in-use pages.

This can occur if the number and size of the messages being processed is bigger than the size of the buffer pool. This could be a lot of messages in a unit of work,  big messages using lots of pages, or lots of transactions putting and getting messages.  It can also occur when there are lots of puts, but no gets.

If the buffer pool has between 85%  and 95% of in-use pages( between 15%  and 5% free pages),  the buffer manager is working hard to keep free pages available.

There will be I/O intermittently at checkpoints, and a steady I/O as the buffer manager writes the pages to the page set.

If messages are being read from disk, there will be read activity from the pageset, but the buffer pool page can be reused as soon as the data has been copied from the buffer pool page.

The buffer pool has less than 5% free pages.

The buffer manager is in overdrive.  It is working very hard to keep free pages in the buffer pool.   There will be pages written out to the page set as it tries to increase the number of free pages.  Gets may require reads from page set I/O.  All of this I/O can cause I/O contention and all the page set I/Os slow down, and so MQ API request using this buffer pool slow down.

What should be monitored

Most cars these days have a “low fuel light” and a “take me to a garage” light.  For monitoring we can provide similar.

Monitor the % buffer pool full.

  • If it is below 85% things are OK
  • If it is between 85% and 95% then this needs to be monitored, it may be “business as usual”.
  • If it is >=95% this needs to be alerted.  It can be caused by applications or channels not processing messages

Monitoring the number of pages written does not give much information.

It could be caused by a checkpoint activity, or because the buffer pool is filling up.

Monitoring the number of pages read from the page set can provide some insight.

If you compare today with this time last week you can check the profile is similar.

If the buffer pool is below 85% used,

  • Messages being got are typically in the buffer pool so there is no read I/O.
  • If there is read I/O this could be for messages which were not in the buffer pool – for example reading the queue after a queue manager restart.

If the buffer pool than 85% in-use  and less than 95% in-use  this can be caused by a large message work load coming in, and MQ is being “elastic” and queuing the messages.  Even short lived messages may be read from disk.  The number of read I/Os give an indication of throughput.  Compare this with the previous week to see if the profile is similar.

If the buffer pool is more than 95% in-use this will have an impact on performance, as every page processed is likely to have I/O to the page set, and elongated I/O response time due to the contention.

What to do

You may want “operators notes” either on paper or online which describe the expected status of the buffer pools on individual queue managers.

  1. PROD QMPR
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP3 green except for Friday night when it goes amber  read I/O rate 6000 I/O per minute.
  2. TEST QMTE
    1. BP 1 green less than 85% busy
    2. BP 2 green less than 85% busy
    3. BP 3 usually amber – used for bulk data transfer

What to do the buffer statistics mean?

There are statistic on the buffer pool usage.

  1. Buffer pool number.
  2. Size of buffer pool – when the data was produced.
  3. Low – the lowest number of free pages in the interval.  100* (Size – log)/Size  gives you % full.
  4. Now – the current number of free pages in the interval.
  5. Getp – the number of requests ‘get page with contents’.  If the page is not in the buffer pool then it is read from the page set.
  6. Getn.   A new page is needed.   The contents are not relevant as it is about to be overwritten.  If the page is not in the buffer pool, just allocate a buffer, and do not read from the page set.
  7. STW – set write intent.  This means the page was got for update. I have not seen a use for this. For example
    1. A put wants to put part of a message on the page
    2. A get is being done and it wants to set the “message has been got” flag.
    3. The message has been got, and so pointers to a page need to be updated.
  8. RIO -the number of read requests to the page set.  If this is greater than zero
    1. The request is for messages which have not been used since the queue manager started
    2. The buffer pool had reached 85%, pages had been moved out to the page set, and the buffer has been reused.
  9. WIO the number of write I/Os that were done.  This write could be due to a checkpoint, or because the buffer pool filled up.
  10. TPW total pages written, a measure of how busy the buffer pool was.   This write could be due to a checkpoint, or because the buffer pool filled up.
  11. IMW – immediate write.  I have not used this value, sometimes I observe it is high, but it is not useful.  This can be caused by
    1. the buffer pool being over 95% full, so all application write I/O is synchronous,
    2. or a page was being updated during the last checkpoint, and it needs to be written to the page set when the update has finished.  This should be a small number.  Frequent checkpoints (eg every minute) can increase this value.
  12. DWT – the number of times the Deferred Write processor was started.  This number has little value.
    1. The DWP could have started and been so busy that it never ended, so this counter is 1.
    2. The DWP could have started, written a few pages and stopped – and done this repeatedly, in this case the value could be high.
  13. DMC.   The number of times the buffer pool crossed the 95% limit.  If this is > 0 it tells you the buffer pool crossed the limit
    1. This could have crossed just once, and stayed over 95%
    2. This could have gone above 95%, then below 95% etc.
  14. SOS – the buffer pool was totally short on storage – there were no free pages.  This is a critical indicator.  You may need to do capacity planning, and make the buffer pool bigger, or see if there was a problem where messages were not being got.
  15. STL – the number of times a “free buffer” was reused.  A buffer was associated with a page of a page set.   The buffer has been reused and is now for a different page.  If STL is zero, it means all pages that were used were in the buffer pool
  16. STLA – A measure of contention when pages are being reused.  This is typically zero.

Summary

Now you know as much as I do about buffer pools you’ll see that the %full (or %free) is the key measure.  If the buffer pool is > 85% used pages, then the I/O rate is a measure as to how hard the buffer manager is running.

What should I monitor for MQ on z/OS – logging statistics

For the monitoring of MQ on z/OS, there are a couple of key metrics you need to keep an eye on for the logging component, as you sit watching the monitoring screen.

I’ll explain how MQ logging works, and then give some examples of what I think would be key metrics.

Quick overview of MQ logging

  1. MQ logging has a big(sequential) buffer for logging data, it wraps.
  2. Application does an MQPUT of a persistent message.
  3. The queue manager updates lots of values (eg queue depth, last put time) as well as move data into the queue manager address space.  This data is written to log buffers. A 4KB page can hold data from many applications.
  4. An application does an MQCOMMIT.  MQ takes the log buffers up and including the current buffer and writes it to the current active log data set.  Meanwhile other applications can write to other log buffers.
  5. The I/O finishes and the log buffers just written can be reused.
  6. MQ can write up to 128 pages in an I/O. If there are more than 128 buffers to write there will be more than 1 I/O.
  7. If application 1 commits, the IO starts,  and then application 2 commits. The I/O for the commit in application 2 has to wait for the first set of disk writes to finish, before the next write can occur.
  8. Eventually the active log data set fills up.  MQ copies this active log to an archive data set.  This archive can be on disk or tape.   This archive data set may never be used again in normal operation.  It may be needed for recovery of transactions or after a failure.   The Active log which has just been copied can now be reused.

What is interesting?

Displaying how much data is logged per second.

Today       XXXXXXXXXXXXXXXXXXXX
Last week XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Yesterday XXXXXXXXXX          
      0                     100MB/Sec    200 MB/Sec

This shows that the logging rate today is lower than last week.   This could be caused by

  1. Today is just quieter than last week
  2. There is a problem and there are fewer requests coming into MQ.   This could be caused by problems in another component, or a problem in MQ.    When using persistent messages the longest part of a transaction is the commit and waiting for the log disk I/O.  If this I/O is slower it can affect the overall system throughput.
  3. You can get the MQ log IO response times from the MQ log data.

Displaying MQ log I/O response time

You can break down where time is spent in doing I/O into the following area

  1. Scheduling the I/O – getting the request into the I/O processor on the CPU
  2. Sending the request down to the Disk controller(eg 3990)
  3. Transferring data
  4. The I/O completes, and send an interrupt to z/OS, z/OS has to catch this interrupt and wake up the requester.

 Plotting the I/O time does not give an entirely accurate picture, as the time to transfer the data depends on the amount of data to transfer.  On a well run system there should be enough capacity so the other times are constant.    (I was involved in a critical customer situation where the MQ logging performance “died” every Sunday morning.   They did backups, which overloaded the I/O system).

In the MQ log statistics you can calculate the average I/O time.  There are two sets of data for each log

  1. The number of requests, and sum of the times of the requests to write 1 page.  This should be pretty constant, as the data is for when only one 4KB was transferred
  2. The number of requests, and sum of the times of the requests to more than 1 page.  The average I/O time will depend on the amount of data transferred.
  • When the system is lightly loaded, there will be many requests to write just one page. 
  • When big messages are being processed (over 4 KB) you will see multiple pages per I/O.
  • If an application processes many messages before it commits you will get many pages per I/O.   This is typical of a channel with a high achieved batch size.
  • When the system is busy you may find that most of the I/O write more than one page, because many requests to write a small amount of data fills up more than one page.

I think displaying the average I/O times would be useful.   I haven’t done tried this in a customer environment (as I dont have customer environment to use).    So if the data looks like

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     XXXXXXXXXXXXXXXXXXXXXXXXXXXXX  
One hour ago XXXXXXXXXXXXXXXXXXX
time in ms 0 1 2 3

it gives you a picture of the I/O response time.

  • The dark green is for I/O with just one page, the size of the bar should be constant.
  • The light green is for I/O with more than one page, the size of the bar will change slightly with load.  If it changes significantly then this indicates a problem somewhere.

Of course you could just display the total I/O response time = (total duration of I/Os) / (total number of I/Os), but you lose the extra information about the writing of 1 page.

Reading from logs

If an application using persistent messages decides to roll back:

  • MQ reads the log buffers for the transaction’s data and undoes any changes.
  • It may be the data is old and not in the log buffers, so the data is read from the active log data sets.
  • It may be that the request is really old (for example half an hour or more), MQ reads from the archive logs (perhaps on tape).

Looking at the application doing a roll back, and having to read from the log.

  • Reading from buffers is OK.   A large number indicates application problem or a DB/2 deadlock type problem.  You should investigate why there is so much rollback activity
  • Reading from Active logs … . this should be infrequent.  It usually indicates an application coding issue where the transaction took too long before commit.  Perhaps due to a database deadlock, or bad application design (where there is a major delay before the commit)
  • Reading from Archive logs… really bad news…..  This should never happen.

Displaying reads from LOGS

Today         XXXXXXXXXXXXXXXXXXXXXXXX
Last week     X
One hour ago  XXXXX
rate          0        10    20     40

Where green is “read from buffer”, orange is “read from active log”, red is “read from Archive log. Today looks a bad day”.

Should I monitor MQ – if so what for ?

I’ve been talking to someone about using the MQ SMF data, and when would it be useful. There is a lot of data. What are the important things to watch for, and what do I do when there is a problem?

Why monitor?

From a performance perspective there are a couple of reasons why you should monitor

Today’s problems

Typical problems include

  1. “Transaction slow down”, people using the mobile app are timing out.
  2. Our new marketing campaign is a great success – we have double the amount of traffic, and the backend system cannot keep up.
  3. The $$ cost of the transactions has gone up.   What’s the problem.

With problems like transaction slow down, the hard part is often to determine which is the slow component.  This is hard when there may be 20 components in the transaction flow, from front end routers, through CICS, MQ, DB2, IMS, and WAS, and the occasional off-platform request.

You can often spot problems because work is building up, (there are transactions or messages queued up), or response times are longer.  This can be hard to interpret because “the time to put a message and get the reply from MQ is 10 second” may at first glance be an MQ problem – but MQ is just the messenger, and the problem is beyond MQ.  I heard someone say that the default position was to blame MQ, unless the MQ team could prove it wasn’t them.

Yesterday’s problem

Yesterday/last week you had a problem and the task force is looking into it.  They now want to know how MQ/DB2/CICS/IMS etc was behaving when there was a problem.  For this you need Historical Data.  You may have summary data recorded on a minute by minute basis, or you may have summary data recorded over an hour.   If the data is averaged over an hour you may not see any problems. A spike in workload may be followed by no work, and so on average every thing is OK.
It is useful to have “maximum value in this time range”. So if your maximum disk I/O time was 10 seconds in this interval at 16:01:23:45 and the problem occurred around this time, it is a good clue to the problem.

Tomorrow’s problem.

You should be able to get trending information.  If your disk can sustain an I/O rate of 100MB a second, and you can see that every week at peak time, you are logging an extra 5MB/second, this tells you that you need to so something to fix it, either get faster disks, or split the work to a different queue manager.

Monitoring is not capacity planning.

Monitoring is how is it performing in the current configuration.  Monitoring may show a problem, but it is up to the capacity and tuning team to fix it.  For example – how big a buffer pool do we need is a question for the capacity team.  You could make the buffer pools use GB of buffers – or keep the buffer pools smaller and let MQ “do the paging to and from disk”.

How do you know when a ‘problem’ occurs.

I remember going to visit a customer because they had a critical problem with the performance on MQ.  There were so many things wrong it was hard to know where to start.  The customer said that the things I pointed out were always bad – so they were not the current problem.  Eventually we found the (application) problem.  The customer was very grateful for my visit – but still did not fix the performance problems.

One thing to learn from this story is that you need to compare a bad day with a good day, and see what is different.  This may mean comparing it with the data from the same time last week, rather than from an hour ago.  I would expect that last week’s profile should be a good comparison to this week.   One hour ago there may not have been any significant load.

With MQ, there is a saying “A good buffer pool is an empty buffer pool”.  Does a buffer pool which has filled up, and causing lots of disk I/O mean there is a problem?  Not always.  It could be MQ acting a queueing system and if you wait for half an hour for the remote system to restart all of the messages will flow, and the buffer pool become empty.  If you know this can happen, it it good to be told it is happening, but the action may be “watch it”.  If this is the first time it has happened, you may want to do a proper investigation, and find out which queues are involved, which channels are not working, and what remote system are down.

What information do I need?

It depends on what you want.  If you are sitting in the operations room watching the MQ monitor while sipping a cup of your favourite brew, then you want something like modern cars.  If there is a problem, a red light on the dashboard light up meaning “You need to take the car to the garage”.   The garage can then put the diagnostic tools onto the engine and get the reason.

You want to know if there is a problem or not.  You do not need to know you have a problem to 3 decimal places – yes, maybe, or no is fine.

If you are investigating a problem from last week, you, being the role of the garage mechanic, need the detailed diagnostics.

When do you need the data?

If you are getting the data from SMF records you may get a record every 1 minute, or every half an hour.  This may not be frequent enough while there is a problem.  For example if you have a problem with logging, you want to see the data on a second by second basis, not once every 30 minutes.

Take the following scenario.  It is 10:59 – 29 minutes into the period when you get an SMF (or online monitor) data.

So far in this interval, there have been 100,000  I/Os.   The total time spent doing I/Os is 100 seconds,  By calculation the average time for an IO is 1 millisecond.  This is a good response time.

You suddenly hit a problem, and the IO response time goes up to 100 ms, 10 more I/Os are done.

The total number of I/Os is now 100,010 , the time spent doing I/OS is now 101 seconds.  By calculation the average I/O time is now 1.009899 milliseconds.  This does not show there is a problem as this is within typical variation.

If you can get the data from a few seconds ago and now you can calculate the differences

  1. number of IOs 100,010 – 100,000 = 10
  2. time spent doing I/O 101 -100 = 1 second
  3. Average I/O time 100 ms – wow this really shows the problem, compared to calculating the value from the 30 minute set of statistics which showed the time increasing from 1 ms to 1.01 ms.
This shows you need granular data perhaps every minute – but this means you get a lot of data to manage.
 

You have to be brave to climb back up the slippery slope.

It is interesting that you notice things when you are sensitive to it.  Someone bought a new car in “Night Blue” because they wanted a car that no one else had – she had not seen any cars of that colour.  Once she got it, she noticed many other cars of the same make and colour.

I was sliding down a slippery slope, and realised that another project I was working on had also gone down a slippery slope.

My slippery slope.

I wanted a small program to do a task.  It worked, and I realised I needed to extend it, so with lots of cutting and pasting, and editing the file soon got to 10 times the size.  I then realised the problem was a bit more subtle and started making even more changes.  I left it, and went to have dinner with a glass of wine.

After the glass of wine I realised that now I understood the problem, there were easier (and finite) solutions to the problem.  Should I continue down the slippery slope or, now that I understood the problem, start again.

I tried a compromise, I wrote some Python code to process a file of data, to generate the C code, which I then used. As this worked, I used this solution, and so; yes it was worth stopping and going up the slippery slope and finding a different solution.

I had a cup of tea to celebrate, and realised that I could see the progress down a slippery path for another project I was working on.

The slippery slope of a product customisation.

I was  trying to configure a product, and thought the configuration process was very complex.    I could see the slippery slope the development team had taken with the product to end up with a very complex solution to a simple problem.

I looked into the configuration expecting to see a complex product which needed a complex configuration tool, but no, it looked just like many other products.

Many products (including MQ on z/OS) configuration consists of

  1. Define some VSAM files and initialize them
  2. Define some non VSAM files
  3. Create some system definitions
  4. Specify parameters for example MQ’s TCP/IP port number.

The developer of the product I was trying to install had realised that there were many parameters, for example the high level qualifier of data sets, as well as product specify parameters such as TCP/IP port number, and so developed some dialogs to make it easy to enter and validate the parameters.  This was good – it means that the value can be checked before they are used.

The next step down the slippery slope was to generate the JCL for the end user,  this was OK, but instead of having a job for each component, they had one job with “If configuring for component 1 then create the following datasets”  etc.  In order to debug problems with this, they then had to capture the output, and save it in a dataset.   They then needed a program to check this output and validate it.  By now they were well down the slippery slope.

The same configuration parameter was needed in multiple components, and rather use one file, used by all JCL, they copied the parameter into each component.

During  configuration it looks as if it copied files from the SMP target libraries, to intermediate libraries, then to instance specific libraries.  I compared the contents of the SMP target libraries with the final libraries and they were 95% common.  It meant each instance had its own self contained set of libraries. 

I do not want to rerun the configuration in case it overwrites the manual tweaking I had to do.

I would much rather have more control over the configuration, for example put JCL overrides such as where to allocate the volumes, in the JCL, so it is easy to change.

A manager said to me once, the first thing you should do every day, once you have your first coffee is to remind yourself of the overall goal, and ask if the work you are doing is for this goal, and not a distraction.  There is a popular phrase  – when you’re up to your neck in alligators, it’s hard to remember that your initial objective was to drain the swamp.

 

If the facts do not match your view – do not ignore the facts.

It is easy to ignore facts if you do not like them, sometimes this can have major consequences.    I recently heard two similar experiences of this.

We were driving along in dense fog trying to visit someone who lived out in the country.  We had been there in the daylight and thought we knew the way.  The conversation went a bit like the following

  • It is along here on the right somewhere – there should be a big gate
  • There’s a gate.  Oh they must have painted it white since last time we were here
  • The track is a bit rough, I thought it was better than this.
  • Ah here’s another gate.   They must have installed it since we were here last.
  • Round the corner and here we – oh where’s the house gone?

Of course we we had taken the wrong side road.  We had noticed that the facts didn’t match our picture and so we changed the facts. Instead of thinking “that gate is the wrong colour” we thought “they must have painted the gate”.  Instead of “we were not expecting that gate” we thought “they must have installed a new gate”.  It was “interesting” backing the car up the track to the main road in the dark.

I was trying to install a product and having problems.  I had already experienced a few problems where the messages were a bit vague.  I had another message which implied I had mis-specified something.  I checked the 6 characters in a file and thought “The data is correct, the message must be wrong, I’ll ignore it”.   I gave up for the day.  Next day I looked at the problem, and found I had been editing the wrong file.  The message was correct and I had wasted 3 hours.

Do not restart with the fire hose set to maximum.

There is an article in The Register, about an outage at the Tokyo stock exchange.  One of the problems was that they did not have a process for restarting the environment.  The impact of restarting a system is often overlooked, and in the panic of “get it started as quickly as possible” things can go wrong.   The fire brigade slowly increases the pressure in a fire hose to stop the fire crew from being knocked down with the sudden flow.

TCP/IP is good because it has a “slow start” protocol.  Once a connection has been established, and is working well, the exchange can use bigger buffers, and send more buffers before waiting for the acknowledgement.   This boosts the throughput.  If the back-end is slow to process the data, TCP  slows down the traffic, and then increases the throughput again if the connection can handle it.  If the connection stops and restarts, the rate starts slowly and builds up, rather than use the rate just before the outage.

You cannot expect WAS/CICS/DB2/MQ/IMS to restart at maximum speed; it has to work up to it.  Transactions may have to warm up. There can be many reasons:

  1. Data many need to be read from page-sets into buffers, for example read hot Db/2 data into memory.
  2. Java code needs to warm up to become more efficient (JITed).
  3. The systems need to establish a working set, for example making a buffer pool larger.
  4. Establishing connections may have some serialisation delays.

Restarting faster than a system can cope can cause a domino effect.  A transaction server is restarted and the fire hose of data is turned on.   The transaction server is still warming up, and cannot cope with the volume of requests.    Work for this system is then routed to another transaction server which could handle the workload if the volume gradually increases,  If it gets this additional work  all at once, this instance slows down, and the work is routed to another transaction server etc.

MQ can be seen as the bad guy here.  When you restart MQ, it can go to fire hose mode immediately.   You should start the output channels first to start draining messages, then gradually start the input channels.   If you start the input channels before the output channels, you may get queues and page sets filling up, before the output channels can process the messages.

If you have a policy that all client connects must disconnect and reconnect a random time between15 minutes and 45 minutes this should help spread the load, and gradually you should get a balanced environment.

Do we still need the wine maker’s nose, the mechanics ear and the performance analyst’s glasses?

When I left University one of my university friends went into the wine industry.  We met up a few years later and said that his nose was more useful than his PhD in Chemistry.  Although they had moved towards gas chromatography (which gave you a profile of all of the chemicals in the wine), this was good at telling you if there were bad chemicals in the brew, but not if it would be a good vintage, for that they needed the human nose.

My father would tune his motorbike by listening to it.  He said the bike would tell you when you had tuned it just right, and got it “in the sweet spot”.  These days you plug the computer in and the computer tells you what to do.  A friend of mine had an expensive part replaced, because the computer said so.  A week later he took the car back to the garage because the computer “knew” there was a problem with the same expensive part, and said it should be replaced.    This time the more experienced mechanic cleaned a sensor and solved the problem.  Computers do not always know best.

When I first started in the performance role, the RMF performance reports were bewildering.   These reports were lots of numbers in a small font (so you needed your glasses).  Worse than that, they had several reports on the same page, and to a novice there was a blur of numbers.  Someone then helped me with comments like, you can ignore all the data, except for this number  3 inches in and 4 inches down.  That should be less than 95%.  On this other page – check this column is zeros, and so on.  As you gain more experience in performance, you get to know the “smell” of the data.  It just needs a quick sniff test to check things are OK.  If not, then it takes more time to dig into the data.

There are many tools for processing the SMF data and printing out reports full of numbers, but they add little value.    “The disconnect time is 140 microseconds” – is this good, or is it bad, it better than a disconnect time of 100?.    If the tools were smart enough to say “The disconnect time is 140 microseconds. This value should typically be zero” then this give you useful information instead of just data.

If you think that they could control the Starship Enterprise from one operations desk, they clearly did not have all of the raw data displayed.  It must have been smart enough to report “The impulse engines are running hot: colour red, suggest you reduce power”, because that is what Scottie the engineer kept saying.

If there were smart reports of the problems rather than just displaying data, it would reduce the skill needed to interpret reports, and the need for the performance analyst’s glasses.  Producing these smart reports is difficult and needs experience to know what is useful, and what is just confusing.

Sometimes it feels like the statistics produced have not been thought through.  One example I recently experienced; there is a counter of  the number of reads+writes to disk rather than cache.  For reads, there should be no reads from disk.  For writes, it may be good to write directly to disk, and not flood the cache.  Instead of one number for reads, and one number for write, there is one number for both.  So If I had 10 disk reads, 10 disk writes and 10 disk accesses – is this good or bad ? I don’t know.   This is not a head banging problem, as you usually have only reads or only writes – but not both.  I just had to use my nose,  10 million would be a problem, just 10 – not a problem, and I’ll still need my glasses.

 

What data set is my C program using?

I wanted to know what data set my C program was using.  There is a facility BPXWDYN: a text interface to dynamic allocation and dynamic output designed for REXX users, but callable from C.

This is not very well documented, so here is my little C program based on the sample IBM provided.

The documentation says use  RTARG dsname = {45,”rtdsn”};  but this is for alloc.  With “info” it gives the error message IKJ56231I TEXT UNIT X’0056′ CONTAINS INVALID KEY .  Which basically means rtdsn is not value.     I had to use RTARG dsname = {45,”INRTDSN”};

#include <stdio.h> 
#include <stdlib.h>
#include <string.h>
#include <errno.h>
int main(int argc, char * argv[]) {
typedef int EXTF();
#pragma linkage(EXTF,OS)
EXTF *bpxwdyn=(EXTF *)fetch("BPXWDY2 ");
int i,j,rc;
typedef struct s_rtarg {
short len;
char str[260];
} RTARG;
char *info ="info DD(APF1) ";

RTARG dsname = {45,"INRTDSN"}; // not rtdsn as the doc says
RTARG ddname = {9,"INRTDDN"}; // not rtddn as the doc says
RTARG volser = {7,"INRTVOL"};
RTARG msg = {3,"MSG "};
RTARG m[4] = {258,"msg.1",258,"msg.2",258,"msg.3",258,"msg.4"};

rc=bpxwdyn(info,&dsname,&ddname,&volser,
&msg,&m[0],&m[1],&m[2],&m[3]);
if (rc!=0) printf("bpxwdyn rc=%X %i\n",rc,rc);

if (*ddname.str) printf("ddname=%s\n",ddname.str);
if (*dsname.str) printf("dsname=%s\n",dsname.str);
if (*volser.str) printf("volser=%s\n",volser.str);
for (i=0,j=atoi(msg.str);i<j && i<4;i++)
printf("%s\n",m[i].str);

return;
}

 

“To infinity and beyond” and how to avoid a whoopsie.

I was reminded of how hard it is to predict the capacity needed for a workload when I read the news about the UK government web site which allowed you to enter your post code, and it told you what level of lock down you were in.  It crashed under the workload.  It should have been clear that within minutes of the web site being announced, a few million people would try to use it.  (It might have been better to have a static web page with all the information rather than try to provide a data base lookup).

Someone told me of a US company whose marketing company had a 1 minute commercial in the interval of the US Super-bowl competition.  They told this to the IT department a week before the game!   The audience of the game is about 100 million people.  If 1% of these people click on a web site within 2 minutes of the advert, this 10, 000 hits per second!  The typical web activity for the company was about 50 hits a second.   After the initial cries of disbelief, the IT department with the help of a large multinational IT company got the additional capacity, and hardware to do load balancing, and got through the night.

One bank said they took the average transaction rate and tested to twice this.  There was a discussion about what the average rate means.   Over a period you have highs and lows.  There is a rule of thumb (I don’t know whose thumb) which says on average, the peak is typically 3 times the average.  Within a peak period (for example 1 hour) looking at a second by second, there will be peaks within peaks.  The rule of thumb said you should plan to support 3 * 3 = (10) times the sustained average.

This bank then worked with IBM to replicate the environment within IBM and run a test to see where the bottlenecks and snags were, and ramped up the workload till they met their targets.  I remember looking at the MQ data, and found the same snags in MQ as we had spotted when we looked at their MQ system a couple of years before.  Logs were not stripped, and they had badly tuned buffer pools. 

Another customer’s test system had more capacity than the production system, so they tested weekly at production volumes + 25%. Many customer’s test systems are much smaller than production and operate on the pray system, where they hope and pray they will not have a problem.

How hard is it to delete lots of data sets? – Easy!

I was configuring a product and had some problems, so I needed to clean up.  I had hundreds of VSAM clusters.  I started using ISPF 3.4 and using the delete line command, but you have extra typing to do for VSAM files, so I gave up.

I had a faint memory of using a MASK to delete things, and a quick search gave me

//S1 EXEC PGM=IDCAMS 
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DELETE COLIN.O.RTE.RK* MASK
/*

Which deleted all my data sets.

Wasn’t this easy!