Non functional requirements: error messages

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.
  • The banks want the messages from the web server to be in their national language, for example Japanese banks want the messages to come out in Japanese.

See start here for additional topics.

What standards do you need to specify for these web server messages?

Consider the following code to issue a database request

EXECUTE_SQL(returnCode,ReasonCode, “SELECT FROM…”, pReturnedData,&length)

Where

  • return code is 0, 4, 8,12 or 16
    • 0 all worked
    • 4 is warning, perhaps no records found
    • 8 is error, perhaps invalid table specified
    • 12 is severe error – you do not have access to the table
    • 16 critical or terminating – major problem so bad the system is shutting down
  • Reason code. This gives an error specific code which allows you to identify the warning or error, there may be thousand of these
  • “SELECT FROM…” is the string passed to the database
  • pReturnedData is where the returned data is stored
  • &length is the length of the buffer on input, and is set to the length of the data on return.

You could have some code to report the error.

if (rc != 0)
{
printf("Hey Dude, Database error!");
return (rc);
}

This is wrong in so many ways.

  • It provides very little useful information.
  • It does not report the return and reason code, which you need to identify the problem
  • You do not know which source module reported this message
  • It is hard to look up on the internet to see if this problem has been reported before.
  • It does not display the message in the National Language. Adding code If language=Japanese printf(….) is not practical – as the message content is embedded with in the program source.
  • If you have 10,000 transactions hitting these problem you will get flooded with messages, and it will be hard to see any other messages.

What can you do?

Rather than put the printf inline, call a message function, and pass the variable data. This message function looks up the message boilerplate and substitutes the variables. Different languages are handled by having a different file of messages.

In IBM on the mainframe there are standards for messages.

  • Each product is allocated a 3 digit character string prefix. This is used in the source code names, for messages etc. By looking at the first three characters of a message, I know which product it came from.
  • The next character is for the major component within the product. xxxC… may be the command processor, xxxS… may be for the statistics component.
  • Then comes a 3 or 4 digit number.
  • The last character is I, W,E,S,T to represent Information, Warning, Error, Serious error, Terminating error. With this scheme you can have automation which ignores ‘I’ and ‘W’ messages, and only takes action for ‘E’, ‘S’ and ‘T’. For example if you get a ‘T’ message then page someone.

You may want to reuse the same message – for example you issue the data base request in 100 source files. Or you use a macro to generate the code. It makes sense to use the same error message number, but you need to provide something to identify which of the 100 source files, and which of the several instance within one source file.

You could give the source file name, but this may give away confidential information about your product structure. Instead, you could give each source file a number. You combine the source file identifier with the line number in the file. You then report it as a hex code ‘00050028’ so this would be for module ‘5’ line 40 (0x28).

Your code then calls a function

MSG_MODULE('ABCD1234W',0x00050028,rc,reason,pString); 

Where pString points to useful information like the table name which caused the error.

In your msg_module, if the language is English, you locate the external constant ABCD1234W in the English file, if the language is Japanese you locate the external constant ABCD1234W in the Japanese file.

This string may be something like

“ABCD123W Database problem. Return code %1$d, Reason code %2$d, location %3$8.8x, table %4$s”

Where

  • %1$d says covert the first parameter to a decimal
  • %2$d says convert the second parameter to a decimal
  • %3$8.8x says convert the third parameter to a hexadecimal number
  • %4$s says treat the 4th parameter as a string.

You should use %1$d instead of %d because the parameters may be displayed in a different order, such as

“ABCD123W Il y a une database problem avec table %4$s. Return code %1$d, Reason code %2$d, location %3$8.8x”

What information is useful?

You need to decide where to produce the information, for the end user, or the operators.

For operators

For each message you need to decide what information is useful. A message such as

“ABCS555E Security violation”

Could be improved by adding the userid causing the violation, and what resource was being accessed, the time the event occurred and the server instance.

For the end user

With security messages you need to be careful not to give information away to people breaking into your system. A message

Userid or password invalid

is better than

Password invalid

because the second message tells the hacker the userid chosen is valid, it is just the password which is wrong. With Userid or password is invalid you do not know if the userid is invalid, or the password is invalid, or both are invalid.

Make it searchable on the internet

If people get an error message, they will look in the documentation or search the internet. If someone enters the message number, and the inserts, they should be able to find any references to this, perhaps in user groups. If you have messages in different languages, such as English and Japanese, you should just need the message number and inserts. The message text may be helpful, but not needed.

You need to be careful to be consistent in the user of decimal numbers and hex numbers. Some people may treat 16 as decimal 16, and others may treat it as 0x16 or decimal 22.

How to stop a flood of messages

If you are running 1000 transactions a second, and there is a database problem you will get at least 1000 messages reporting the database problem.

You might want to do some processing to summarise information. For example, display a message. If the same message is quickly produced, then do not display it, but accumulate the count of messages, then report

FLOOD MESSAGE. 404 instance in the last minute of “ABCD1234E Database problem return code 8, reason code 144 identifier 003304AB”

FLOOD MESSAGE. 19 instance in the last minute of “ABCE333S Database contact lost problem return code 12, reason code 26 identifier 0033025C”

Provide useful information

You should provide one messages for each unique problem or situation, and list the actions to take to resolve any problems. As your product gets used, you may find there are more causes for each problem, and so you need to update the messages to reflect this. I think it would be great to allow users to vote on solutions, so if there are 3 solutions listed, one has 100 votes, one has 2 votes and the other has no votes, then try the popular solution first.

For each message you need to provide

  • A longer description of the message
  • What the system action was (did it do anything like close a database)
  • What the end user/administrator should do

For example the real message

CSQ9016E ‘cmd‘ command request not authorized

This page give which product and which component. It is an Error message.

The message has sections for

  • Explanation
  • System action
  • System programmer response

Another example

  • CSQ5007E csect-name RRSAF function function failed for plan plan-name, RC=return-code reason=reason syncpoint code=sync-code
  • Explanation :A non-zero or unexpected return code was returned from an RRSAF request. The Db2 plan involved was plan-name.
  • System action: If the error occurs during queue manager startup or reconnect processing, the queue manager might terminate with completion code X’6C6′ and reason code X’00F50016′. Otherwise, an error message is issued and processing is retried.
  • System programmer response: Determine the cause of the error using the RRS return and reason code from the message. See Db2 codes in the Db2 for z/OS documentation for an explanation of the codes and attempt to resolve the problem.

Leave a comment