Stackoverflow: What throughput can a standalone Java program achieve?

There was a question on the MQ section on StackOverflow

I have a standalone multi threaded java application which listen messages from IBM MQ.
Current system take around 500ms for processing of 1 message after it read from queue and till it commit.
I want to know how many messages I can consume

  • Concurrently:
  • Max number of messages can be processed? or throttle limit

A good meaty performance question I thought.  Let me break this into pieces.

Current system take around 500ms for processing of 1 message after it read from queue and till it commit.

Processing one messages and commit should take about 10 milliseconds or less( say 30 ms for a two phase commit).    There is clearly something else going on.  Fix this first.

  1. A long database call.   This could be due to database locking, or a badly designed statement, for example a query which needs to access thousands or millions of rows.
  2. A request to a server far far away
  3. A file system with the speed of writing an illuminated letter to parchment

How many messages I can consume: Concurrently:

Take the worst case of using persistent messages, which require log IO during commit.

For one thread, processing multiple messages before doing a commit means the thread can do more work.  Consider a get taking 1 millisecond, and a commit taking 10 ms. This is one message processed every 11 ms.  If you did 50 gets – taking 50 ms and a commit taking 10 ms, this is 50 messages in 50 + 10 ms which equates to one message every 1.2 milliseconds almost 10 times faster.    This is how channels can send messages efficiently.   There is a “sweet spot” of messages per commit to give you maximum data processed per second.   This depends on the message size, logging rates and other factors.  For a 100MB message it is one message per commit.  For 10KB messages,  this may be 1000 messages per commit.

This may be selfish

This is clearly a great improvement, but possibly selfish.  If the application logic is a get followed by a database insert, followed by a commit, then doing 50 gets, 50 inserts and a commit, will work much faster.  The down side is that the database requests will keep locks until the commit.  These locks may prevent other applications from accessing data, either the recently inserted  records, page locks, or index locks. So overall MQ throughput goes up – but the business transaction suffers.    You need to understand the database and find the optimum number of requests per commit for your business transaction.

How long before the data is visible?

Rather than have one thread process 1000 messages per commit (taking 1010 ms) you may want to have multiple threads processing 10 messages per commit – taking 20 ms.  This means that the data in the database (or replies etc) are visible earlier.    This may be important to your business transaction if you have to worry about response time.

Parallel  threads

  1. Using more threads should improve throughput, unless this is delayed by external factors – such as database locks.
  2. One customer found one thread was optimum because there was no database delays.

How many messages I can consume: Max number of messages can be processed? or throttle limit

There are papers written on this but here is a one minute overview

As fast as the queue manager can process data

  1. The rate at which MQ can write its logs
  2. Keep queue data in memory – ( buffer pools on z/OS, queue buffer on midrange), so few messages on the queue.

Threads

  1. Having parallel threads gives you better throughput than one thread.  You get overlapped writing to the log, the units of work are shorter in duration, you can get parallel IO.
  2. You may be limited by the network.   Having multiple threads from an application means the network can be better utilized.  One thread can be receiving data down the wire, while another thread is waiting in commit.
  3. You may be limited by where your programs run – eg short of CPU, or slow IO (for your System.out.println statements)

Application design

  1. You may get delays due to serialization if all thread are using the same queue.
  2. Remove the debug printf or System.out.println statements.
  3. Using a queue per business application is better than all applications sharing the same queue
  4. Using one reply to queue per web server may be better than a shared reply to queue – especially if you use Apache Camel.
  5. Use get first if possible.  Avoid scans of the queues.

 

The short answer….

You should be able to get thousands of 1KB messages a second through your Java application when using multiple threads.

 

Making MQ swing with Spring. Tuning Camel and Spring (to get rid of the 1 second wait in every message).

Spring is a framework which sits on top of Apache Camel, which runs on Oracle WebLogic web server.   It simplifies writing java applications for processing messages.

I was involved in tracking down some performance problems when the round trip time for a simple application was over 1 second – coming from a z/OS background I thought 50 milliseconds was a long time – so over 1 second is an age!

The application is basically a Message Drive Bean.  It does

  1. Listening applications get a message from a queue.  There  can be multiple listening threads
  2. Get an application thread
  3. Pass it via an internal queue to the application running on another thread – the listening thread is blocked while the application is running.
  4. This application sends (puts) a message to an MQ queue for the backend server and waits for the response.
  5. Return to the listening application
  6. Free the application thread.

As this is essentially the same as a classic MDB, we had configured the number of application threads in the thread pool the same number as the listener thread pool.

Shortage of threads

The symptoms of the problem looked like a shortage of threads problem.

When we increased the number of threads in the application pool ( we gave it 4* the number of listener threads) The response time dropped – good news.  I dont know how many threads you need – is it n+1 or 2* n. I’ll leave finding the right number  as an exercise for the reader!

The hard coded 1 second wait before get

One symptom we saw was the queue depth on the replyTo queue on the server was deeper than normal.

For the reply coming back from the server, I believe there is one thread getting from the queue.

When the reply to queue is not shared

The application thread has sent the request off, and is now waiting.  This getting thread does an MQ destructive GET with wait.  When the message arrives, it looks at the content, and decides which application thread is waiting for the reply, and wakes it up.

When the reply queue is shared between instances

For example you have configured two instances for availability.  The above logic is not used.

Instance1 cannot destructively get a message because the message could be destined for instance2.  Similarly instance2 cannot get destructively get the message because it could be destined for instance1.

One way to solve this would be to do a get next browse of the message, and if it is for the instance do a get_message_under_cursor.  This works great for MQ, but not other message providers which do not have this capability.

The implementation used  is to use polling!

If there are 3 applications tasks waiting for a replies, reply1, reply3, reply4.  The logic is as follows

  1. For each reply id, use MQ message selectors to try to get reply1, reply3, reply4.   This is not a get by messageID or correlID – for which the queue manager has indexes to quickly locate the message, this is a get with message selector, which means look in every message on the queue.
  2. For any messages found – pass them to the waiting applications
  3. Do an MQGET with a message selector which does not exist – with a wait time of receiveTimeout (defaults to 1 second).  Every message is scanned looking for the message selector string, and it is not found.

Looking at a time line – in seconds and tenths of seconds
0.1 send request1, wait for reply
0.2 getting task does MQGET  wait for 1 second with non exising selector
0.2 send request 2, wait for reply
0.3 reply1 comes back
0.4 send request3, wait for reply
0.5 reply2 comes back
0.6 reply3 comes back
nothing happens
1.2 getting task waiting for 1 second times out
1.2 getting tasks gets reply1 and posts task1
1.2 getting tasks gets reply2 and posts task2
1.3 getting tasks gets reply3 and posts task3

So although  reply1 was back within 0.2 seconds (0.3 – 0.1) it was not got until time 1.2.  The message had been waiting on the queue for 0.9 seconds.

  • Total wait time for reply1 was  1.1 seconds.
  • Total wait time for reply2 was  1.2 – 0.2 = 1.0 seconds
  • Total wait time for reply3  was 1.3 – 0.4 = 0.9 seconds

Wow – what a response time killer!

You can tune this time by specifying receiveTimeout.  If you make it smaller, the wait time will be shorter, so messages will be processed faster, but the CPU cost will go up as more empty gets are being done.

This solution does not scale.

You have had a slow down, and there are now 1000 messages on this queue. (990 of these are unprocessed messages, due to timeout .  There is no task waiting for them – they never get processed – nor do they expire!)

  • MQGET for reply1.  This scans all 1000 messages – looking in each message for the message with the matching selector.  This takes 0.2 seconds.
  • MQGET for reply2. This scans all 1000 messages – looking in each message for the message with the matching selector.  This takes 0.2 seconds.
  • You have 10 threads waiting for messages, so each message has to wait for typically 10 * 0.2 seconds = 2 seconds a message!

 

What can you do about it.

See the Camel documentation Request-Reply over JMS, parameters, Exclusive, concurrentConsumers, and Request-Reply over JMS Using an Exclusive Fixed Reply Queue 

  1. Avoid sharing the queue.  Give each instance its own queue.  Set the Exclusive flag
  2. Tune the ReceiveTime out – making it a shorter time interval can increase the CPU as you are doing more empty gets.  You might want to set it, to a value which is 95% percentile time between the send to the server, and the reply comes back.  So if the average time is 40 ms, set it to 60 ms or 80 ms.
  3. If you are going to share the queue, make sure you clean out old messages from the queue – for example use expiry, or have a process which periodically scans the queue and moves messages older than 1 minute.
  4. Did I mention avoid sharing the queue.
  5. If you get into a vicious spiral where the response time gets longer and longer, and the reply queue from the server gets deeper – be brave and purge server reply queue.
  6. Avoid sharing the queue.

 

Are my digital certificates still valid and are they slowing down my channel start?

Digital certificates are great. They allow program to program communication so each end can get information to identify the other end, and the programs can then communicate securely, with encryption, or just checking the payload has not changed.

A certificate is basically a file with two parts (or two files)  – a public certificate and a private key. You can publicize the public part to any one who wants it (which is why is is called the public part). Anyone with the private key can use it to say they are you. (If you can get access to the private key, then you can impersonate the identity)

There are times when you want to say this certificate is not longer valid. For example when I worked at IBM, I had a certificate on my laptop to access the IBM mail servers.

  • If my laptop was stolen, IBM would need to revoke the certificate.
  • When I retired from IBM, IBM revoked my certificate to prevent me from trying to access my IBM mail using my old certificate from my personal laptop.

Managing these certificate is difficult. There could be billions of certificates in use today.

Your server should validate every connection request to ensure the certificate sent from the client is still valid.

A client should validate the certificate sent by the server to ensure that it is connecting to a valid server.

In the early days of certificates, there was a big list of revoked certificates – the Certificate Revocation List(CRL). If a certificate is on the list then it has been revoked. You tend to have an LDAP server within your firewall which contains these lists of revoked certificates.

This was a step in the right direction, but it is difficult to keep these lists up to date, when you consider how many certificates are in use today, and how many organizations generate certificates. How often do I need to refresh my list? If the CRL server was to refresh it every day, it may be up to one day out of date, and report “this certificate is ok” – when it had been revoked.

These days there is a technique called Online Certificate Status Protocol (OCSP). Basically this says go and ask the site which issued this certificate if it is still valid. This is a good idea – and they say the simple ideas are the best.
How do you know who to ask? A certificate can have url information within in it Authority Information Access: OCSP – URI:http://ColinsCert.Checker.com/, or you can specify URL information in the queue manager configuration for those certificates without the Authority Information Access(AIA) information.

Often the URL in the certificate is outside of your organization, and outside of your firewall. To access the OCSP site you may need to have an SSL Proxy server which has access through the firewall.

You can configure MQ to use a (one) OCSP server for those certificates not using AIA information.  If your organization is a multinational company, you may be working with other companies who use different Certificate Authorities.   If you have certificates from more than one CA, you will not be able use MQ to check all of them to see if they are still valid.  You may want to set up an offline job which runs periodically and checks the validity of the certificates.

Starting the MQ channel can be slow

When an TLS/SSL MQ channel is started, you can use OCSP or CRL to check that a certificate is valid. This means sending a request to a remote server and waiting for a response.   The LDAP server for CRL requests is likely to be within your domain,  as your organization manages it. The OCSP  server could be outside of your control, and in the external network.  If this server is slow, or the access to the server is slow, the channel will be slow to start.  For many customers the network time within a site is under 1 millisecond, between two sites could be 50 ms. Going to an external site the time could be seconds – and dependent on the traffic to the external site.
This time may be acceptable when starting the channel, first thing in the morning, but restarting a channel during a busy period can cause a spike in traffic because no messages flow while the channel is starting. For example

ChannelwithRestart

No messages will flow while the channel is starting, and this delay will add to the round trip time of the messages.

How do I check to see if I have a problem

This is tough. MQ does not provide any information. I used MQ internal trace when debugging problems, but you cannot run with trace enabled during normal running.

There are two parts to the validation request. The time to get to and from the server, and the time to process your request once it has got to the server.

You can use TCP Ping to get to the server (or to the proxy server).  If you are using a proxy server you cannot “ping” through the proxy server.

Openssl provides a many functions for creating and managing certificates.

You can use the command

time openssl s_client -connect server:port
or
time openssl s_client -connect -proxy host:port server:port

The “time” is a linux command which tells your the duration in milliseconds of the command following it.

The openssl s_client is a powerful ssl client program. The -connect… tries connecting to server:port. You can specify -proxy host:port to use the proxy.

The server at the remote end may not recognize the request – but you will get the response time of the request.

Running this on my laptop I got

time openssl s_client -connect 127.0.0.1:8888
CONNECTED(00000005)
140566846001600:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../ssl/record/ssl3_record.c:332:

no peer certificate available

No client certificate CA names sent

SSL handshake has read 5 bytes and written 233 bytes

Verification: OK

real 0m0.012s

The request took 0.012 seconds.

My openssl OCSP server reported

OCSP Response Data:
OCSP Response Status: malformedrequest (0x1)

You can use openssl ocsp …. to send a certificate validation request to the OCSP server and check the validity of the system – but you would have to first extract the certificates from the iKeyman keystore.

How do I test this?

I’ve created the instructions I used:

More information about OCSP.

There is a good article in the IBM Knowledge Centre here .

The article says
To check the revocation status of a digital certificate using OCSP, IBM MQ determines which OCSP responder to contact in one of two ways:

  1. Using a URL specified in an authentication information object or specified by a client application.
  2. Using the AuthorityInfoAccess (AIA) certificate extension in the certificate to be checked.

Configure a QM for OSCP checking for certificates without AIA information.

You can configure a queue manager to do OCSP checking, for those certificates without AIA information within the certificate.

There is a queue manager attribute SSLCRLNL ( SSL Certificate RevocationList Name List) which points to a name list.   This name list has a list of AUTHINFO object names.
The name list can have up to one AUTHINFO object for OCSP checking and up to 10 AUTHINFO objects for CRL checking.

You define an AUTHINFO object to define the URL of an OCSP server.

Define AUTHINFO(MYOSCP) AUTHTYPE(OCSP) OCSPURL(‘HTTP://MyOSCP.server’)

Create a name list, and add the AUTHINFO to it.

Use alter qmgr  SSLCRLNL(name) and refresh security type(SSL)
You need to change the qm.ini file SSL stanza of the queue manager configuration file see here.

OCSPAuthentication=REQUIRED|WARN|OPTIONAL
OCSPCheckExtensions= YES|NO
SSLHTTPProxyName= string

If you have a fire wall around your network, you can use SSLHTTPProxyName to get through your fire wall.

There is some good information here.

Configure client OSCP checking for certificates without AIA information.

You need the CCDT created by a queue manager rather than a JSON CCDT.
When you configure a queue manager with the AUTHINFO objects and the queue manager SSLCRLNL attributes, the information is copied to the CCDT.

This CCDT is in the usual location, for example the /prefix/qmgrs/QUEUEMANAGERNAME/@ipcc directory.

You can use a CCDT from one queue manager, when accessing other queue managers.

You need to make the CCDT file available to the client machines, for example email or FTP,  or use URL access to the CCDT.

You also should configure the mqclient.ini file see here.

How do I check to see if my certificates have AIA information.

You can use the iKeyman GUI to display details about the certificate, or a command line like

/opt/mqm/bin/runmqckm -cert -details -db key.kdb -pw password -label CLIENT

This gives output like

Label: CLIENT
Key Size: 2048
Version: X509 V3
Serial Number: 01
Issued by: ….

Extensions:
– AuthorityInfoAccess: ObjectId: 1.3.6.1.5.5.7.1.1 Criticality=false
AuthorityInfoAccess [
[accessMethod: ocsp
accessLocation: URIName: http://ocsp
server.my.host/
]
]

Can I turn this OCSP checking off ?

For example if you think you have a problem with OCSP server response time.

In the mqclient.ini you can set

  • ClientRevocationChecks = DISABLED No attempt is made to load certificate revocation configuration from the CCDT and no certificate revocation checking is done
  • OCSPCheckExtensions = NO   This says ignore the URL in the AIA information within a certificate.

See SSL stanza of the client configuration file.

In the qm.ini you can set

  • OCSPCheckExtensions=NO  This says ignore the URL in the AIA information within a certificate.
  • alter qmgr SSLCRLNL(‘ ‘) and refresh security type(SSL)

See SSL stanza of the queue manager configuration file and Revoked certificates and OCSP.

How do I tell if I have a problem with OCSP?

There are no events or messages which tell you the response time of requests.

You may get message AMQ9716

AMQ9716
Remote SSL certificate revocation status check failed for channel …
Explanation
IBM MQ failed to determine the revocation status of the remote SSL certificate for one of the following reasons:
(a) The channel was unable to contact any of the CRL servers or OCSP responders for the certificate.
(b) None of the OCSP responders contacted knows the revocation status of the certificate.
(c) An OCSP response was received, but the digital signature of the response could not be verified.
You can change the queue manager configuration to not produce these messages, by setting ClientRevocationChecks = OPTIONAL
From this message you cannot tell if the request got to the server.
The easiest way may be to ask the network people to take a packet trace to the URL(s) and review the time of the requests and the responses.

Using the AuthorityInfoAccess (AIA) certificate extension in the certificate.

You can create certificates containing the URL needed to validate the certificate. Most of the IBM MQ documentation assumes you have already have a certificate with this information in it.
You can use openssl to create a certificate with AIA information, and import it into the iKeyman keystore.   See here.

You cannot use IBM GSKIT program iKeyman to generate this data because it does not support it. You can use iKeyman to display the information once it is inside the keystore.

Timing the validation request

Openssl has a command to validate a certificate for example

openssl ocsp -CAfile cacert.pem -issuer cacert.pem -cert servercert.pem -url http://OCSP.server.com:port -resp_text

You can use the linux time command, for example

time openssl ocsp -CAfile cacert.pem -issuer cacert.pem -cert servercert.pem -url http://OCSP.server.com:port -resp_text

you get

Response verify OK
servercert.pem: good
real 0m0.040s

The time taken to go to a OCSP server on the same machine is 40 milliseconds. The time for a ping to 127.0.0.1 was also 40 ms.

Thanks…

Thanks to Morag of MQGEM, and Gwydion at IBM for helping me get my head round this topic.