z/OS systems-ssl strange behaviour with environment variables

I was trying to use system ssl to write a program to use native z/O TLS facilities. I wasted a couple of hours because it said it could not find my keyring. Then when I collected a trace, it sometimes did not find the file – which did exist as I could list it.

If I used

//START1   EXEC PGM=GSKMAIN,REGION=0M, 
//* PARM='4000'
// PARM=('ENVAR("_CEE_ENVFILE=DD:STDENV")/4000')
//STDENV DD PATH='/u/ibmuser/gskparms'

When the USS file had

GSK_TRACE_FILE=/tmp/zzztrace.file 
GSK_TRACE=0xff
GSK_KEYRING_FILE=START1/TN3270

This worked file

When I used

//START1   EXEC PGM=GSKMAIN,REGION=0M, 
//* PARM='4000'
// PARM=('ENVAR("_CEE_ENVFILE=DD:STDENV")/4000')
//STDENV DD *
GSK_TRACE_FILE=/tmp/zzztrace.file
GSK_TRACE=0xff
GSK_KEYRING_FILE=START1/TN3270
/*

This failed to work.

If I looked in the trace file I had

ENTRY gsk_open_keyring(): ---> Keyring 'START1/TN3270                       ' 

Where it had taken the whole length of the line – and so START1/TN3270 padded with blanks was not found.

The trace file was not /tmp/zzztrace.file, it was /tmp/zzztrace.file padded with lots of blanks!

The answer is to use a environment file in USS, not in JCL or a data set.

Destination unreachable, Port unreachable. Which firewall rule is blocking me?

I was trying to connect an application on z/OS through a server to my laptop – so three systems involved.

On the connection from the server to my laptop, using Wireshark I could see no traffic from the application.

When I used Wireshark on the z/OS to server connection I got

   Source   Destination port Protocol info 
>1 10.1.1.2 10.1.0.2 2175 TCP ..
<2 10.1.1.1 10.1.1.2 2175 ICMP Destination unreachable (Port unreachable)

This means

  1. There was a TCP/IP Packet from 10.1.1.2 (z/OS) to 10.1.0.2 (mylaptop) port 2175
  2. Response:Destination unreachable (Port unreachable)

This was a surprise because I could ping from z/OS through the server to the laptop.

Looking in the firewall log ( /var/log/ufw.log) I found

[UFW BLOCK] IN=tap0 OUT=eno1 MAC=... SRC=10.1.1.2 DST=10.1.0.2 ... PROTO=TCP SPT=1050 DPT=2175 ...

This says

  • Packet was blocked. When using the ufw firewall – all of its messages and definitions contain ufw.
  • From 10.1.1.2
  • To 10.1.0.2
  • Source port 1050
  • Destination port 2175

With the command

sudo ufw route allow in on tap0 out on eno1

This allows traffic to be routed through this node from interface tap0 to interface eno1, and solved my problem.

What caused the problem?

iptables allows the systems administrator to define rules (or chains of rules – think subroutines) to control the flow of packets through the Linux kernel. For example

  • control input input packets destined for this system
  • control output packets from this system
  • control forwarded packets flowing through this system.

ufw is an interface to iptables which makes it easier to define rules.

You can use

sudo ufw status

to display the ufw definitions, for example

To                         Action      From
-- ------ ----
22/tcp ALLOW Anywhere
Anywhere on eno1 ALLOW Anywhere
Anywhere on tap0 ALLOW Anywhere (log) # ‘colin-ethernet’

You can use

sudo iptables -L -v

to display the iptables. The -v options show you how many times the rules have been used.

sudo iptables-save reports on all of the rules. For example (a very small subset of my rules)

-A FORWARD -j ufw-before-forward
-A ufw-before-forward -j ufw-user-forward
-A ufw-user-forward -i tap0 -o eno1 -j ACCEPT
-A ufw-user-forward -i eno1 -o tap0 -j ACCEPT

-A ufw-skip-to-policy-forward -j REJECT --reject-with icmp-port-unreachable

Where

  • -A FORWARD.… says when doing forwarding use the rule (subroutine) called ufw-before-forward. You can have many of these statements
  • -A ufw-before-forward -j ufw-user-forward add to the end of subroutine ufw-before-forward, call (-jump to) subroutine ufw-user-forward
  • -A ufw-user-forward -i tap0 -o eno1 -j ACCEPT in subroutine ufw-user-forward, if the input interface is tap0, and the output interface is eno1, then ACCEPT the traffic, and pass it on to interface eno1.
  • -A ufw-user-forward -i eno1 -o tap0 -j ACCEPT in subroutine ufw-user-forward, if the input interface is eno1, and the output interface is tap0, then ACCEPT the traffic, and pass it on to interface eno1.
  • -A ufw-skip-to-policy-forward -j REJECT –reject-with icmp-port-unreachable. In this subroutine do not allow the packet to pass through, but send back a response icmp-port-unreachable. This is the response I saw in Wireshark.

With -j REJECT you can specify

icmp-net-unreachable
icmp-host-unreachable
icmp-port-unreachable
icmp-proto-unreachable
icmp-net-prohibited
icmp-host-prohibited
icmp-admin-prohibiteda

The processing starts at the top of the tree and goes into each relevant “subroutine” in sequence till it finds and ACCEPT or REJECT.

If you use sudo iptables -L -v it lists all the rules and the use count. For example

Chain FORWARD (policy DROP 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
...
259 16364 ufw-before-forward all -- any any anywhere anywhere

Chain ufw-before-forward (1 references)
pkts bytes target prot opt in out source destination
...
77 4620 ufw-user-forward all -- any any anywhere anywhere

Chain ufw-user-forward (1 references)
pkts bytes target prot opt in out source destination
0 0 ACCEPT all -- eno1 tap2 anywhere anywhere
0 0 ACCEPT all -- tap2 eno1 anywhere anywhere
9 540 ACCEPT all -- tap0 eno1 anywhere anywhere
0 0 ACCEPT all -- eno1 tap0 anywhere anywhere

Chain ufw-reject-forward (1 references)
pkts bytes target ...
45 2700 REJECT ... reject-with icmp-port-unreachable
  • For the packet forwarding it processed a number of “rules”
    • 259 packets were processed by subroutine ufw-before-forward
  • Within ufw-before-forward, there were several calls to subroutines
    • 77 packets were processed by subroutine ufw-user-forward
  • Within ufw-user-forward the line (in bold) said there were 9 packets, which were forwarded when the input interface was tap0 and the output was eno1.
  • Within the subroutine ufw-reject-forward, 45 packets were rejected with icmp-port-unreachable.

The ufw-reject-forward was the only instance of icmp-port-unreachable with packet count > 0. This was the rule which blocked me.

Log file

In the /var/log/ufw.log was an entry for [UFW BLOCK] for the address and port,

Non functional requirements: backups

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.

See start here for additional topics.

Why backup?

You need to take backups, (and more importantly be able to restore them) for various reasons

  • To recover from media failures.
  • To recover from human failure. You may have mirrored disks, but if an operator deletes a file or table, it will be reliably deleted on the mirrored disks.
  • You may be asked for historical information. 10 years ago, did this person have an account with you, and can you show the transactions on the account.

How to backup

For a simple file, it is easy to backup.

For a database, or file which is continually being updated, you need a more sophisticated approach. If a transaction is deleting funds from one account and incrementing the funds in a different account, you need to ensure that the backup has consistent data.

With databases you can back up an “inflight” database. If you need to restore it, it replays the transaction log and reapplies any transactions.

Other solutions is to have the main database read only, and do updates in a small database in front of the main database.

You could also partition the database, for example the A partition for surnames beginning with A, etc. These should be smaller than one large database, and so quicker to backup.

What do you backup?

You need to think about what you backup. For example people’s names and addresses do not change very much, but their current balance may change every day.

How long to keep the backup for?

You may have to keep backups for 10 years depending you your industry regulator.

How much does it cost ?

When you are specifying the project there will be many unknowns, so you need to make assumptions.

For example in the brief it says there will be 10 million users and 1 million trades a day.

Non functional requirements: do not create a straitjacket.

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.

See start here for additional topics.

The functional straitjacket

You may decide to use a facility like a particular database, because you can use a function that only that particular database provides. If you want to move from this database supplier – you are tied because of this function. It may be expensive writing this function yourself – to allow you to move with out disrupting the ongoing usage.
You may decide to use a standard level of function, such as SQL – but there are different standards. For example one standard of SQL can support JSON.

To avoid the straitjacket, you might decided on a subset of functions, and the applications cannot use functions outside this subset.

Testing for this

You might want to do most of your testing on one platform/environment, but include tests on different environments. For example run your Java web server on Windows and some on Linux, and have two different back end databases.

This may increase the development costs – but this is cheaper than trying to escape from the straitjacket when running in production.

Designing for this

Rather than scatter database calls throughout your code. Consider a component or domain which does all of the database calls. If you need to change your database, you have only to change the component, not the whole product. It allows mapping of different return codes to be done once, as well as mapping parameters.

With some databases (such as DB2) you run a program “runstats” which tells DB2 to update it’s meta-knowledge of the tables. This helps the database manager make the best decision for accessing the data, when there are multiple paths to the data. For example should it use a sequential search to access rows, or use an index. If this meta data changes, you need to rebind your applications to pick up the changes. If your SQL is isolated to one component, you just have to rebind that component, and not the whole product.

Other straitjackets

  • Use of certificate authority certificates. Can you change CA chain?
  • Use of computer language, and deprecated functions within that function.
  • Use of a particular compiler. Your code does not compile on a different compiler or different operating system
  • Use of packages for TLS – and using unsupported cipher specs.
  • Use of virtualisation system
  • Use of cloud provider.

Non functional requirements: do you want immediate database consistency or eventual consistency?

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.

See start here for additional topics.

What is consistency?

After an end user sells some widgets, if you display the status you should see the number of widgets owned has gone down, and the money in the users’ account has increased. The number of system wide available widgets has gone up, and the amount of money in central account has gone down. All the numbers should be consistent. If there is a problem with the system, such as a database outage, or a power cut, when the system recovers the data should still be consistent.

Wikipedia says

In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps.[1] In the context of databases, a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data) is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

If there is one big database this is pretty “obvious”.

There are databases with “eventual consistency“. These databases are distributed, and highly available, and it takes a short time (seconds) for updates to be propagated to other instances. Eventually all instances become consistent.

You may make an update on your phone, but when you look with your laptop’s browser, it takes a few seconds to reflect the update – because a different server and database instance were used.

Distributed databases

A single database is a single point of failure. You can have databases which are distributed. For example you have many sites and a database instance at each site. When every a user makes a trade, the local database is updated, and at commit time, the updates are sent to the remote sites, and the changes applied to the other database. Immediately after the trade, the databases are inconsistent. A short time later (seconds) all databases are consistent.

This looks a pretty simple design. However It gets more complex when there are updates occurring on all instances at the same time.

For this to work, the update sent to the other system will reflect the changes such as “10 widgets sold” “credit account with $100”. Rather than absolute values “current balance $400”

What to think about?

You need to consider your availability targets. 100% is achievable, but you need to consider you will need to shutdown machines an reboot to apply fixes, or to move the machine instance.

Can you tolerate a eventual consistent environment, or do you need totally consistent image?

How will it scale?

Will you partition your data so groups of users such as with names in the range A-C go to one server, D-F go to another server etc?

Consider having name and address information in one database – as the information does not frequently change, and dynamic information such as number of widgets, and account balance in another database.

If you have an eventually consistent database, how do you stop people from having multiple sessions all simultaneously trying to transfer your money to an external bank account – and exploiting the delay when using eventual consistency .

Non functional requirements: the additional expense of cheap solutions

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.

See start here for additional topics.

The cost of change.

In the press you can find references to organisations who want to move off an existing platform or solution, but find the cost of doing so is too expensive. The moving cost is many times the original cost of implementation, and the cost of moving is more than the “cost savings”. But they decide to move anyway. I’ve been reading about a council whose cost of moving went from an initial figure of £3 million to over £25 million – and they haven’t completed it yet.

An analogy

Someone described migrating an application from one platform to another was a bit like changing the engines of an air-plane – while the plane is in the air. The new engine will not be a drop in replacement and you have to keep flying to your destination.

Upgrading software from one version to a different version should work, but there may be small differences.

Replacing a core component such as using a different database will need a lot of work. This may be due to

  • a different performance profile,
  • the SQL may not be entirely consistent,
  • you use a facility which is not in the newer database and so you need to change your application.
  • Error messages and codes may be dissimilar.

Looking at the costs

What is the cost?

As part of the discussions with a supplier of a service, such as cloud, CPU, disk space, network capacity, you may have got a good initial deal. After the honeymoon period the cost of these services increases, and you may not be prepared for this. For example the cost of disk space could be an amount per GB per day.

  • The cost is 10 cents per GB per day
  • Your database is 10GB so costs $1 a day.
  • You also backup your database daily and keep the copy for 10 years.
    • After 1 day you have 1 10GB backup – costing $1
    • On the 10th day you have 10 backup copies costing $10 * 1
    • On the 1000th day you have 1000 backups costing $1000 * 1
    • After n days the accumulated cost of the backups is n*(n+1)/2 so 1000 days costs half a million dollars.
    • You also have multiple copies of the backups for Disaster Recovery processes etc. After 1000 days, the accumulated cost is $1Million! This is despite “the cost is only 10 cents per GB per day”.

The charging changes

The terms, conditions and costs may change – basically the price goes up more than expected. As I write this, Broadcom is increasing the cost of the VM license by a factor of 10 times and people are now moving to a different virtualisation technology.

Do not lock yourself in

You may find that you are using a facility in the environment, but this facility is not available in other environments, for example cloud providers. It may be worth using common facilities rather than platform unique facilities, as this makes it easier to change, but at a higher initial cost.

Non functional requirements: security

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.

See start here for additional topics.

What security?

Security covers

  • Application users – for example using their mobile phone to authenticate
  • What userid will be used on the web server to run the transactions. Is this related to the end user’s id?
  • What fields are visible to the application user
  • What fields are available to the help desk staff. For example can they see the full date of birth – or they type the DOB into a field and it is validated.
  • Are you going to provide audit information for any changes to the database; for all fields, or only for some fields.
  • Are you going to report on read only access to some fields.
  • How are you going to report violations
  • Are you going to use encryption on fields. How do you protect the keys.
  • Is your database going to be encrypted – so if someone copies the database file they are unable to read it, or are you going to rely on the fields being encrypted.
  • What encryption are you going to use – some encryption is weak (quantum computers will be able to decrypt some ciphers in an instant)
  • Are your backups encrypted?
  • Able your backup and disaster recovery sites able to restored from backups. Do they have the correct certificates?
  • If someone phones in and says they have forgotten their password – how do you validate the request – bearing in mind the phone may have been stolen.

Non functional requirements: metrics and checking your product is performing within spec

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.

See start here for additional topics.

Why measure?

Your management team want this product to be a success. How do you know if you are achieving required performance, and how do you avoid a twitter storm of people complaining that your product is slow? Some businesses, like those buying and selling on the stock exchange, get fined if they do not meet externally specified performance targets.

Three areas you may be interested in:

  1. Will there be a problem next week, next month? Can we see trends in the existing data, such as CPU usage is going to max out, or the end user response time at peak time will be out of spec.
  2. Is there a performance problem now? If so, can you identify the problem area?
  3. Last month when you had a performance problem – have you kept enough data to identify the problem. The problem may not have shown up in your real time monitoring.

What do you want to measure?

The people specifying the product have said the average response end user response time needs to be under 50 milliseconds. This needs some clarification

  • You cannot control or influence the phone network, so you do not want to be blamed for a slow network. A better metric might be “time in your system” must be less than 40 milliseconds
    • This means you need to capture the entry and exit time, and report the transaction duration on exit.
    • You need a dash board to display this information.
  • The “average response time”: if you get excellent response time at night, when no one is using the system, and poor response time during the lunch break – the average of these response times may be under 40 milliseconds. This way of reporting the data is useless. You might want to report the information on a granularity of 1 minute.
  • You might want to display a chart of the count of any transactions taking over 40 milliseconds, and display the maximum count value in the interval. The should always be zero.
  • You might want to report the average response time per hour, and report this over the last 3 months (or the last 13 months). This should allow you to see if there are any trends in response time, and gives you time to take action before it is a problem.
  • Your application could record the time spent doing database work, and if this time is over 20 milliseconds, report and plot this.
  • If you have any exceptions, such as long response time, then you could send an event to a thread which captures these, and writes a data base record. You you not want to have a database write after your transaction has completed, just before it returns to the end user.

How to capture data

You can have the transaction collect the per transaction response time within your application server. You could also have a mobile phone that does a transaction every minute – so you get an early notification if there is a network problem

How to look at the data

You need to have a profile of expected/historical data. For example during the data the response time is 40 ms. Overnight it is 10 ms. If you start getting a response time of 20 ms overnight, this should be investigated; it is still below 40 ms, but it does not match the profile, and is a good indicator of a problem.

Different days have different profiles, perhaps Monday is always a peak day, and Christmas day is a quiet day.

Non functional requirements: availability

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.

See start here for additional topics.

Why plan for availability

These days people expect online applications to be available 24*7, which includes the middle of the night, and Christmas day (people of non Christian faiths do not treat Christmas day as special). A common target is an available target of having no more than 5 minutes down time per year. This description is a bit vague, does it mean 5 minutes for 100% of attempts to access your system, or 10 minutes for 50% of attempts to access your system?

You need to ensure your systems are current with fixes, and it may take a day to put on fixes and making the system current – or introducing new hardware. You need a solution which can tolerate this.

You need to allow for site loss – for example a power cut takes out your site, or someone puts a digger through the networking cabling to your building.

Be clear what your availability targets mean.

If you have a failover system, with primary and backup, when your primary system is unavailable, and you are running on the backup system, you do not have a backup system!

What you need is a primary, backup, and an in-reserve backup system which can be quickly activated when the primary system is down.

If you run with a backup system – you may have a lot of resources allocated – but doing nothing. This can increase the cost.

You may have multiple instances all running workload. If one system is taken down, work should be able to flow to the other systems. You need enough spare capacity to handle the workload if one or more system is taken down.

If you are going to run multiple instances you need to consider where requests are routed to. For example can a request from any user go to any server?

Is there a mapping between which users can use which servers? Do account numbers ending in ‘1’ go to the SERVERA1, SERVERB1, … servers? and so on.

Where is the weakest link?

You need to go through your planned configuration and ask “what happens if…”.

For example you may have 100 boxes running web servers processing requests, and this is spread across two sites. If you lose a site you should still have half your servers available to you.

Your applications access the database remotely. How is the database available? Can half of the database machines be taken down? If you lose access to the database disks at one site can the database still operate?
I worked on the IBM mainframe, and DB2 could be spread across different machines and the disks be mirrored across sites. In the event of a disaster, a remote site 100s of miles away could be used to run DB2.

You need to test availability, for example taking components offline.

I remember one customer had excellent procedures for recovery. There was an online document that was carefully maintained. Once they had a problem and lost the machine with this online document, and so could not restart the main machine because they did not have the instructions. They fixed this by printing out a copy of the document once a week.

You need to check

  • CPUs
  • operating system images
  • networking
    • DNS server
    • external firewalls
    • external routers
    • your certificate expire
  • disks
  • databases
  • people (what happens if ‘the expert’ is not present)

At another customer, some key machines where kept in a room, locked with a physical key. The shift manager had the key. This was fine till the shift manager went to have a coffee – and they needed to get into the server room. The switch over took much longer than expected because they had to find the shift manager. You need to consider if enough people have access to the resources. This could be physical access, or logon access.

CPU availability

You need to be able to handle peaks in workload. This can mean

  • As you need more CPU you go and get it. Bearing in mind that if you are charged for service by your cloud provided, changing usage bands can expensive.
  • As your workload increases you use the same amount of resource overall, but reduce testing or workload activities.

One bank I was involved in had two of the largest mainframes that IBM made for production and test. Production work had first call on the CPU. Any spare CPU was used by the test teams (they got a lot of work done overnight). Once the decision was made they could switch all the production work from one mainframe to the other in seconds. If this happened, they then brought up production images on another mainframe (the system programmers sand box), in case it was needed.

Their normal peak time production usage was over 100 times the overnight production usage.

Backup and recovery of data

Backup

You may have mirrored disks, so in the event of a disk failure, the other data is still available. If an operator makes a mistake and deletes a table, mirrored disks do not help you, as the delete will occur on all copies. You need a backup to be able to recover from this.

You may be required by law to recover a database to a known point in time. “Did this person have a banking account with you – and how much was in it?”.

You need to have a process to backup and restore data.

Your database needs the capability to backup tables while the database is in use. If you update two rows in a transaction, only one of the updates may be in the backup. Databases handle this by using transaction logs. If you restore a backup, the database will use the transaction log and replay any updates.

Taking a backup can cause a lot of I/O to the disks. You need to allow for this in your capacity planning.

Your backups need to be stored in a different location to your main data. A university lost many years of data because the backups of their system were stored in a rack next to the computer. They had a fire, and the computer building burnt down, losing all of their data and backups.

Recovery

The important word in Backup and Recovery is recovery. You need to test your recovery procedures – perhaps at a remote, isolated site. Recovery problems I have known

  • The backups were of an “old” database – before the database had been extended. Most of their customer data was not available
  • In the days of physical tapes, the tape drives in the recovery site were not able to read the tapes from the production site.
  • People running the restore did not have the authority to restore the data.
  • There was a problem with one part of a table, and the data could not be restored. All backups had the same problem.

Once you have restored any tables, it may take a long time for any indexes to be created or refreshed.

You will need to use the database recovery to replay any updates since the backup was taken. This could be a long time if there are a large number of updates. These logs need to be available in real time at the recovery system.

You may not have the very latest updates – which occurred just before any failure, for example they were not asynchronously copied to the backup site

Networking and workload balancing

You may need a networking device to do workload balancing across your servers. You need to consider if you want

  • Any server can process a request from any client.
  • Work is routed to a server depending on the client – for example the first letter of the client’s name.

If you lose a site, can you quickly switch traffic to the backup site? I’ve worked with a customer who did this switch once a day.

Switching sites

Most big customers have a main site, a backup site, and a disaster recovery(DR) site. It may take a couple of hours to bring up the DR systems, for example restoring from backups. These systems are the key ones to provide business continuity. Only production system, are provide no test systems.

20 years ago a customer said “for every minute they were down, it cost them a million dollars. If they are down for a day they would be out of business”. When the stakes are this high, you need to have backup, and disaster recovery systems – and these are tested regularly.

What does all this mean?

As well as building for recovery, you need to have smart applications. For example for every database update you insert into a table, date, time, person identifier, before data, after data and change.

  • You can then use this to replay from this table and update the database.
  • You have an audit trail of every update made by the transactions.
  • You can do analysis on the data and extract useful information – such as 5% of people do 90% of the updates.

Non functional requirements: error messages

This blog post is part of a series on non functional requirements, and how they take most of the effort.

The scenario

You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,

At the back-end is a web server.

Requirements you have been given.

  • We expect this application package to be used by all the major banks in the world.
  • For the UK we expect the number of people who have an account to be about 10 million people
  • We expect about 1 million trades a day.
  • The banks want the messages from the web server to be in their national language, for example Japanese banks want the messages to come out in Japanese.

See start here for additional topics.

What standards do you need to specify for these web server messages?

Consider the following code to issue a database request

EXECUTE_SQL(returnCode,ReasonCode, “SELECT FROM…”, pReturnedData,&length)

Where

  • return code is 0, 4, 8,12 or 16
    • 0 all worked
    • 4 is warning, perhaps no records found
    • 8 is error, perhaps invalid table specified
    • 12 is severe error – you do not have access to the table
    • 16 critical or terminating – major problem so bad the system is shutting down
  • Reason code. This gives an error specific code which allows you to identify the warning or error, there may be thousand of these
  • “SELECT FROM…” is the string passed to the database
  • pReturnedData is where the returned data is stored
  • &length is the length of the buffer on input, and is set to the length of the data on return.

You could have some code to report the error.

if (rc != 0)
{
printf("Hey Dude, Database error!");
return (rc);
}

This is wrong in so many ways.

  • It provides very little useful information.
  • It does not report the return and reason code, which you need to identify the problem
  • You do not know which source module reported this message
  • It is hard to look up on the internet to see if this problem has been reported before.
  • It does not display the message in the National Language. Adding code If language=Japanese printf(….) is not practical – as the message content is embedded with in the program source.
  • If you have 10,000 transactions hitting these problem you will get flooded with messages, and it will be hard to see any other messages.

What can you do?

Rather than put the printf inline, call a message function, and pass the variable data. This message function looks up the message boilerplate and substitutes the variables. Different languages are handled by having a different file of messages.

In IBM on the mainframe there are standards for messages.

  • Each product is allocated a 3 digit character string prefix. This is used in the source code names, for messages etc. By looking at the first three characters of a message, I know which product it came from.
  • The next character is for the major component within the product. xxxC… may be the command processor, xxxS… may be for the statistics component.
  • Then comes a 3 or 4 digit number.
  • The last character is I, W,E,S,T to represent Information, Warning, Error, Serious error, Terminating error. With this scheme you can have automation which ignores ‘I’ and ‘W’ messages, and only takes action for ‘E’, ‘S’ and ‘T’. For example if you get a ‘T’ message then page someone.

You may want to reuse the same message – for example you issue the data base request in 100 source files. Or you use a macro to generate the code. It makes sense to use the same error message number, but you need to provide something to identify which of the 100 source files, and which of the several instance within one source file.

You could give the source file name, but this may give away confidential information about your product structure. Instead, you could give each source file a number. You combine the source file identifier with the line number in the file. You then report it as a hex code ‘00050028’ so this would be for module ‘5’ line 40 (0x28).

Your code then calls a function

MSG_MODULE('ABCD1234W',0x00050028,rc,reason,pString); 

Where pString points to useful information like the table name which caused the error.

In your msg_module, if the language is English, you locate the external constant ABCD1234W in the English file, if the language is Japanese you locate the external constant ABCD1234W in the Japanese file.

This string may be something like

“ABCD123W Database problem. Return code %1$d, Reason code %2$d, location %3$8.8x, table %4$s”

Where

  • %1$d says covert the first parameter to a decimal
  • %2$d says convert the second parameter to a decimal
  • %3$8.8x says convert the third parameter to a hexadecimal number
  • %4$s says treat the 4th parameter as a string.

You should use %1$d instead of %d because the parameters may be displayed in a different order, such as

“ABCD123W Il y a une database problem avec table %4$s. Return code %1$d, Reason code %2$d, location %3$8.8x”

What information is useful?

You need to decide where to produce the information, for the end user, or the operators.

For operators

For each message you need to decide what information is useful. A message such as

“ABCS555E Security violation”

Could be improved by adding the userid causing the violation, and what resource was being accessed, the time the event occurred and the server instance.

For the end user

With security messages you need to be careful not to give information away to people breaking into your system. A message

Userid or password invalid

is better than

Password invalid

because the second message tells the hacker the userid chosen is valid, it is just the password which is wrong. With Userid or password is invalid you do not know if the userid is invalid, or the password is invalid, or both are invalid.

Make it searchable on the internet

If people get an error message, they will look in the documentation or search the internet. If someone enters the message number, and the inserts, they should be able to find any references to this, perhaps in user groups. If you have messages in different languages, such as English and Japanese, you should just need the message number and inserts. The message text may be helpful, but not needed.

You need to be careful to be consistent in the user of decimal numbers and hex numbers. Some people may treat 16 as decimal 16, and others may treat it as 0x16 or decimal 22.

How to stop a flood of messages

If you are running 1000 transactions a second, and there is a database problem you will get at least 1000 messages reporting the database problem.

You might want to do some processing to summarise information. For example, display a message. If the same message is quickly produced, then do not display it, but accumulate the count of messages, then report

FLOOD MESSAGE. 404 instance in the last minute of “ABCD1234E Database problem return code 8, reason code 144 identifier 003304AB”

FLOOD MESSAGE. 19 instance in the last minute of “ABCE333S Database contact lost problem return code 12, reason code 26 identifier 0033025C”

Provide useful information

You should provide one messages for each unique problem or situation, and list the actions to take to resolve any problems. As your product gets used, you may find there are more causes for each problem, and so you need to update the messages to reflect this. I think it would be great to allow users to vote on solutions, so if there are 3 solutions listed, one has 100 votes, one has 2 votes and the other has no votes, then try the popular solution first.

For each message you need to provide

  • A longer description of the message
  • What the system action was (did it do anything like close a database)
  • What the end user/administrator should do

For example the real message

CSQ9016E ‘cmd‘ command request not authorized

This page give which product and which component. It is an Error message.

The message has sections for

  • Explanation
  • System action
  • System programmer response

Another example

  • CSQ5007E csect-name RRSAF function function failed for plan plan-name, RC=return-code reason=reason syncpoint code=sync-code
  • Explanation :A non-zero or unexpected return code was returned from an RRSAF request. The Db2 plan involved was plan-name.
  • System action: If the error occurs during queue manager startup or reconnect processing, the queue manager might terminate with completion code X’6C6′ and reason code X’00F50016′. Otherwise, an error message is issued and processing is retried.
  • System programmer response: Determine the cause of the error using the RRS return and reason code from the message. See Db2 codes in the Db2 for z/OS documentation for an explanation of the codes and attempt to resolve the problem.