|
|
A Tale of Computer System Disaster and RecoveryOur story begins quietly on the Tuesday before Thanksgiving, the eve of the "silly season," that period between Thanksgiving and Christmas when most of us are more focused on the holidays than on business, and weekends are reserved for decorating and shopping. The setting is a medium-sized manufacturing company whose operations depend on a custom business application, hosted on a UNIX server and managing customer data, orders, inventory, work-in-process, and more. When the Problem Was SmallThe first hint of a problem appears late Tuesday evening as a failure of the verification stage of the nightly backups. No one is there to notice, but on Wednesday morning the system administrator reads the backup log sent in an email. The message indicates that although verification failed, the backup stage completed successfully, so he decides that the problem doesn't demand an immediate response. This is a mistake. Wednesday night the backup process starts at the usual time, but recognizes a condition that the administrator missed: The backup from the previous night did not terminate cleanly. The new backup process stops immediately with the message: Previous backup is still running. No backup is made for the Wednesday business day. Everyone leaves for a long Thanksgiving weekend. The following Monday the system administrator receives a rush assignment from management to analyze and correct inventory data. No action is taken to address the backup problem, even though the risk of data loss is increasing daily as the error continues to occur. The backup problem drops down the list of priorities, and after all, the system seems to be running just fine. In fact, the backup problem should have top priority. Not only is the system not running "just fine," but a set of permanent, non-recoverable errors have appeared on the system volume. At least one important component of the UNIX operating system is no longer readable, but since the copy in memory is intact and functioning, this is not noticed. Like a running car with a dead battery, it will take you where you want to go, as long as you don't turn it off. Disaster StrikesBecause Wednesday is the last day of November, the system administrator follows his usual procedure and performs a monthly reboot. This is unnecessary for a reliable time-sharing system like UNIX, but it's a habit learned from years of dealing with less-reliable systems that must be rebooted often to remain healthy. In this case, because of the errors on the system volume, reboot has a fatal result: The startup fails. Now there is a serious problem. The administrator looks for the problem until late in the evening. Unable to get the system started, he goes home, returning early on Thursday, December 1. All morning he works frantically to get the system running, but without success. By now, the entire company knows the system is unavailable, including top management. After a brief meeting with the administrator, they decide to call us. The Fire Department ArrivesJust after lunch we receive their urgent call for help and leave immediately for their location, arriving an hour later. We expect to diagnose the problem that day and have the machine running again before the end of the business week. If only we knew what was coming... Since they are a new customer, our knowledge of their systems and operations is limited, so we begin asking questions. They have an eight-year-old UNIX system, a Hewlett-Packard 9000 server, vintage 1997, with two hard drives plus one spare and a CD-ROM drive. The operating system is HP-UX 10.2, also from 1997. We ask about "early warning signals" of a problem and are shown the backup logs. Now we're starting to appreciate how much data is at risk: Four business days of work, eight calendar days, have elapsed since the last backup. We search through a box of manuals, CDs, and many pages of information from various sources, trying to understand what tools are available. Finding a diagnostic CD that should be bootable, we slide it in the drive and turn on the power. The computer starts and runs the diagnostics system. With it we test the disk drives and find the errors on the system volume. Now we know that the primary disk has failed, taking part of UNIX with it. We understand why it won't boot, and we know that recovery from backup is needed. But there's no current backup. We also find that there is a logical drive structure overlaying and spanning the two disks, and we don't know how to deal with it. In addition, some of the logical drive utility programs are lost, the system recovery shell does not see a file system, and and there's no documentation for the logical drive configuration. Gathering ResourcesWe spend Friday at our offices trying to locate recovery procedures, spare disks, and a recovery expert. Progress is slow. As usual we receive many "we'll call you back" responses, followed by a long wait. No one wants to work over the weekend. We consider replacing the entire system and search for a machine we can borrow. The application vendor suggests that they can provide a system, but not until Tuesday, and only if their software is upgraded, and the cost is more than $15,000. By the end of the day on Friday we have a consultant scheduled to arrive at 9:30 am Saturday, the system administrator is loading a new UNIX install on the spare disk, and we all plan to meet Saturday morning and get it fixed. We arrive Saturday morning to find that the administrator has been there all night installing a new operating system on the spare drive, followed by patches and fixes. The process takes more than five hours. If only there was a bootable disk image ready to put on a disk, or, even better, a spare bootable disk! After the install completed he overlaid a backup, but the resulting system won't run. And, no image was made of this installation, so he did it all again, from the beginning! The second time he used the wrong sizes for the physical disks, but that's not discovered until later. Finally he copied an image of the second installation to a network storage appliance. We discuss restoring the last backup over the new one-disk system and decide that it will just fit. This is a relief because the two original drives have at least one logical volume that spans both drives, so we can't easily replace just one of them. Later, if we have time, we can try to recover data from the second drive, data that was never backed up. We describe our plan to management, telling them that we feel confident that they will have a running system on Monday morning, although perhaps with five days of data loss, and set to work. But the surprises keep coming. We Have Backup, But No RestoreThe backup program is third-party, so it needs to be installed. We find the original media and discover, to our chagrin, that it's magnetic tape. There's no mag tape drive in the computer! The system administrator says it came with one, but it failed, was removed, and never replaced. We "borrow" a tape drive from another network server, install it, and configure UNIX with a driver for it. We install the backup program from tape, run it, point it at the backup archive file, and Presto! it stops with the message Error 15, call us for a licensed copy of this program. After an hour or so of searching for a license file and trying to find the original license key, we decide that there's no problem with the license at all, only with the error message. Error 15 is translated in the manual as Backup archive is incomplete or corrupt, and we decide that the license portion of the message is just a tagline, appended to every message. So the backup program won't restore its own files. We consider the possibility that the backup version installed from magnetic tape is not as new as the one that was running on the system, and decide to get those files restored. The backup programs are backed up as well, and by now the entire backup file is on a UNIX laptop brought by our recovery consultant, so he pulls the program files from his machine and copies them to the new system drive. We run this installation, but it gives the same error message. We examine the backup files and decide that they are standard UNIX tar (tape archive) files, so instead of using the backup programs we use tar to put files back on the newly-built disk. We're puzzled by the absence of compression on the tar archive, but shrug it off and continue. This takes many hours because the backup files are located on the network and the HP has only a slow network interface. While we're waiting we do more research about retrieving data from the second disk drive. Secret FormatsOnce the restore completes we check the system and get a surprise: Some restored files are compressed. Closer review of the backup file format and the backup program documentation reveals that there was compression, but in an unusual way. A typical UNIX way of making a backup archive is to collect a group of files into a single tar archive, then compress the entire archive file, naming it in a standard way that signals what was done. The third-party backup program, on the other hand, was compressing files defined as not small, then assembling the resulting files into a tar archive. So some of the files we put on the system are still compressed, and we don't know which ones. We cannot discover how the backup program keeps track of that, either. It's Saturday evening, the system administrator has been here for 36 hours, the rest of us for nearly 12, and it's time to stop for the day. We make a plan for the next day: The UNIX consultant will take the backup archive with him and work on it off-site, teasing it apart, uncompressing the files, and building a standard-format archive. The rest of us will sleep and reconvene on Sunday. Back at the customer location on Sunday, we talk with the administrator by phone, suggesting that he relax at home until we need him. A call to our Unix expert confirms that he's progressing with the reconstruction of the data. We review all our previous mistakes and false starts and decide to carefully follow documented installation procedures. We have an operating system with patches installed, and a problem: The user licenses and other software are not installed. The manual says that a new install includes a two-user license, and you must install the application software to apply other licenses. We need a license key and code. We find a key set and try to use it with the CD set used to install the OS, but it doesn't work, the key is invalid. After some research we conclude that the keycodes are CD-specific. We search for a CD matching our key set, but cannot find it. So much for licenses, we'll work on this again later. Failures On Top of FailuresWe decide to copy the system disk image to tape. We run the low-level utility to do that, but it doesn't recognize the tape drive because the drive is newer than the utility. Luckily, we find another tape drive that is compatible with the original, install that in the server, and try again. We load the utility CD-ROM, boot the computer, and the CD drawer opens. No matter what we do, if it has power, the CD drawer opens. The CD-ROM drive has failed! We check the company's PC morgue and find a SCSI CD-ROM drive in a retired server. We remove it, blow off a cloud of dust, and install it into the HP server. Miraculously, it works! We begin an image backup of the system disk. We continue looking for a license key and matching CD set. After a long search we find another set of license keys and later, matching CDs. Meanwhile our UNIX guru has finished a standard tar archive that we can install using standard UNIX utilities, but there's a small problem: It's on a DVD at his house, about 60 miles away. We try an Internet file transfer, but a bandwidth test indicates that it will take seven hours, and it's already 6 pm. One of us makes the 2.5-hour round trip, then we begin copying the new archive to the network. We now feel confident (but not for the first time) that we can have a functioning system by morning, but with the loss of five days' data. It's now time to pursue the more desirable option of getting it all back. Finally Some Good NewsWe're on the Internet searching for hints about how to access logical volumes from a recovery shell and hit the jackpot: A comprehensive document from the vendor with commands and explanations! Following the instructions we get to the point where we understand the possibilities and the risks. We run the documented commands and access both of the original drives! Checks of the file systems show damage only to the boot volume, the remaining volumes are intact and accessible. We immediately back up the precious data to another disk. That process takes four painful hours, but now we have a complete backup of all the data up to the moment the computer failed. It's the wee hours of Monday morning, we're exhausted, but we have a brief celebration. Then we realize that we may be able to read all the other volumes and restore the system to the way it was. That would be far more desirable since all the system patches, drivers, subsystems, applications, and licenses would be included. The new tar archive is now on the network appliance, but we need it on tape because our recovery shell won't access network devices. We boot up the new disk and copy the archive file from the network to tape. That takes two more hours. Now we must make a crucial decision: If we want to restore the system to the state it was in when it failed, we need a good drive to replace the failed one. We only have one spare, and it's the only bootable drive we have; we'll need to overwrite it. Since we believe that we have at least a 50-50 chance of success, we decide to do it. We copy an image of the original system disk to tape, errors and all, and then copy that image to the spare disk. Then we access the logical volumes from the recovery shell using "magic" utilities, running a file system check on each volume. The primary data volume has a few minor errors, but we recover it. We have a complete backup in any case. The volume that contains UNIX is not readable by the file system checker, so it's lost and will have to be rebuilt. We create a new file system on the system volume and populate it with files from the tar archive. The device files, however, are not usable; they have lost critical attributes. How to replace them? We can't get them from the network because the network is not available from the recovery shell. The rebuilt OS is gone. We're tired and don't want to spend six hours to image the existing drives, build a new OS, copy the devices, etc. But do we have a choice? We take a leap of faith and copy the devices from the recovery shell file system to the system volume. If this doesn't work the system won't boot, and we will have to do it the hard way. We also know there's some risk of damaging the contents of the drives, forcing us to repeat the entire copy process, but we decide it's worth the risk. Success!We push the button and wait, staring at the CRT to catch every boot message, trying to read every error. The system complains a lot, but it boots up! We mount the network appliance and copy the original devices from the backup. When that process finishes the system is working rather well, almost exactly like it was the day it crashed. It's 9 pm Monday, time to go home and sleep! Looking BackWe've built a lot of systems, so we thought we understood enough about operating systems to know what problems we would have. We were wrong. This exercise was a nightmare, with every turn bringing yet another obstacle. Each one was not only difficult to overcome, but often resulted in negating progress that had been made before. Here's a brief view of issues that affected the recovery, the good, the bad, and the ugly: The GoodThe customer had an identical spare drive, all installation media, and a complete backup. Although it was a week old, at least it was complete and verified. The BadThe server was old and parts are not readily available. The system had no hardware or software support agreements with the manufacturer or the vendor of the custom application. The application vendor does not offer support for versions as old as the customer's. It would take at least two business days to get a compatible server, load a current version of the software, update the database, and (perhaps) get a functioning system. The server, OS, and all underlying products would have to be updated. The cost was over $15,000, not including support or rights to the application software upgrade, nor any guarantee of recovery. The UglyThe system was nearly unrecoverable because of the lack of documentation, lack of vendor support, and poor installation and maintenance choices. The MoralAsk yourself this: If I lost my computer system, how would I recover? Could I recover? Would my business survive? The seriousness of a failure is directly proportional to the business value of the system. For a business-critical system, failure is a business-critical issue. If you have a system that is critical to your business, don't neglect it. Keep it properly maintained and under proper support. Assume it will fail completely, at the worst possible moment (year-end closing, for example). Make sure that if it fails you will have competent assistance in a timeframe you can tolerate. Document and test your recovery procedures. Consider investing in a fully simulated system failure and restore exercise. There will be a cost, but the experience will be invaluable if the system fails, especially if the exercise uncovers flaws in the recovery process. The CostsFor the customer in our story, the cost of the recovery was high. Most of the recovery effort took place during a weekend and at night, increasing the costs, which included:
The End |
|
Copyright
© Friday, 21-Nov-2008 01:38:22 CST Commercial Computer Services, Inc.
sales@ccs4vms.com Home | Contact us | Privacy policy | Site design | Links |