This blog post is part of a series on non functional requirements, and how they take most of the effort.
The scenario
You want a third party to implement an application package to allow people to buy and sell widgets from their phone. Once the package has been developed, they will hand it over to you to sell, support, maintain and upgrade and you will be responsible for it,
At the back-end is a web server.
Requirements you have been given.
- We expect this application package to be used by all the major banks in the world.
- For the UK we expect the number of people who have an account to be about 10 million people
- We expect about 1 million trades a day.
See start here for additional topics.
Why plan for availability
These days people expect online applications to be available 24*7, which includes the middle of the night, and Christmas day (people of non Christian faiths do not treat Christmas day as special). A common target is an available target of having no more than 5 minutes down time per year. This description is a bit vague, does it mean 5 minutes for 100% of attempts to access your system, or 10 minutes for 50% of attempts to access your system?
You need to ensure your systems are current with fixes, and it may take a day to put on fixes and making the system current – or introducing new hardware. You need a solution which can tolerate this.
You need to allow for site loss – for example a power cut takes out your site, or someone puts a digger through the networking cabling to your building.
Be clear what your availability targets mean.
If you have a failover system, with primary and backup, when your primary system is unavailable, and you are running on the backup system, you do not have a backup system!
What you need is a primary, backup, and an in-reserve backup system which can be quickly activated when the primary system is down.
If you run with a backup system – you may have a lot of resources allocated – but doing nothing. This can increase the cost.
You may have multiple instances all running workload. If one system is taken down, work should be able to flow to the other systems. You need enough spare capacity to handle the workload if one or more system is taken down.
If you are going to run multiple instances you need to consider where requests are routed to. For example can a request from any user go to any server?
Is there a mapping between which users can use which servers? Do account numbers ending in ‘1’ go to the SERVERA1, SERVERB1, … servers? and so on.
Where is the weakest link?
You need to go through your planned configuration and ask “what happens if…”.
For example you may have 100 boxes running web servers processing requests, and this is spread across two sites. If you lose a site you should still have half your servers available to you.
Your applications access the database remotely. How is the database available? Can half of the database machines be taken down? If you lose access to the database disks at one site can the database still operate?
I worked on the IBM mainframe, and DB2 could be spread across different machines and the disks be mirrored across sites. In the event of a disaster, a remote site 100s of miles away could be used to run DB2.
You need to test availability, for example taking components offline.
I remember one customer had excellent procedures for recovery. There was an online document that was carefully maintained. Once they had a problem and lost the machine with this online document, and so could not restart the main machine because they did not have the instructions. They fixed this by printing out a copy of the document once a week.
You need to check
- CPUs
- operating system images
- networking
- DNS server
- external firewalls
- external routers
- your certificate expire
- disks
- databases
- people (what happens if ‘the expert’ is not present)
At another customer, some key machines where kept in a room, locked with a physical key. The shift manager had the key. This was fine till the shift manager went to have a coffee – and they needed to get into the server room. The switch over took much longer than expected because they had to find the shift manager. You need to consider if enough people have access to the resources. This could be physical access, or logon access.
CPU availability
You need to be able to handle peaks in workload. This can mean
- As you need more CPU you go and get it. Bearing in mind that if you are charged for service by your cloud provided, changing usage bands can expensive.
- As your workload increases you use the same amount of resource overall, but reduce testing or workload activities.
One bank I was involved in had two of the largest mainframes that IBM made for production and test. Production work had first call on the CPU. Any spare CPU was used by the test teams (they got a lot of work done overnight). Once the decision was made they could switch all the production work from one mainframe to the other in seconds. If this happened, they then brought up production images on another mainframe (the system programmers sand box), in case it was needed.
Their normal peak time production usage was over 100 times the overnight production usage.
Backup and recovery of data
Backup
You may have mirrored disks, so in the event of a disk failure, the other data is still available. If an operator makes a mistake and deletes a table, mirrored disks do not help you, as the delete will occur on all copies. You need a backup to be able to recover from this.
You may be required by law to recover a database to a known point in time. “Did this person have a banking account with you – and how much was in it?”.
You need to have a process to backup and restore data.
Your database needs the capability to backup tables while the database is in use. If you update two rows in a transaction, only one of the updates may be in the backup. Databases handle this by using transaction logs. If you restore a backup, the database will use the transaction log and replay any updates.
Taking a backup can cause a lot of I/O to the disks. You need to allow for this in your capacity planning.
Your backups need to be stored in a different location to your main data. A university lost many years of data because the backups of their system were stored in a rack next to the computer. They had a fire, and the computer building burnt down, losing all of their data and backups.
Recovery
The important word in Backup and Recovery is recovery. You need to test your recovery procedures – perhaps at a remote, isolated site. Recovery problems I have known
- The backups were of an “old” database – before the database had been extended. Most of their customer data was not available
- In the days of physical tapes, the tape drives in the recovery site were not able to read the tapes from the production site.
- People running the restore did not have the authority to restore the data.
- There was a problem with one part of a table, and the data could not be restored. All backups had the same problem.
Once you have restored any tables, it may take a long time for any indexes to be created or refreshed.
You will need to use the database recovery to replay any updates since the backup was taken. This could be a long time if there are a large number of updates. These logs need to be available in real time at the recovery system.
You may not have the very latest updates – which occurred just before any failure, for example they were not asynchronously copied to the backup site
Networking and workload balancing
You may need a networking device to do workload balancing across your servers. You need to consider if you want
- Any server can process a request from any client.
- Work is routed to a server depending on the client – for example the first letter of the client’s name.
If you lose a site, can you quickly switch traffic to the backup site? I’ve worked with a customer who did this switch once a day.
Switching sites
Most big customers have a main site, a backup site, and a disaster recovery(DR) site. It may take a couple of hours to bring up the DR systems, for example restoring from backups. These systems are the key ones to provide business continuity. Only production system, are provide no test systems.
20 years ago a customer said “for every minute they were down, it cost them a million dollars. If they are down for a day they would be out of business”. When the stakes are this high, you need to have backup, and disaster recovery systems – and these are tested regularly.
What does all this mean?
As well as building for recovery, you need to have smart applications. For example for every database update you insert into a table, date, time, person identifier, before data, after data and change.
- You can then use this to replay from this table and update the database.
- You have an audit trail of every update made by the transactions.
- You can do analysis on the data and extract useful information – such as 5% of people do 90% of the updates.