A Linux crash, burn and recover experience…

One of my trustworthy RHEL servers which was up (virtually) for 4+ years and served thousands of users over-the-web recently crashed. As data in the server was critical, we had to undertake a salvage operation before re-install. This post is my experience of recovering the server’s data and also some of the lessons learned and new strategies evolved for configuring future resilient server…

Since 2002, I have been volunteering some of my free time to help out the local red cross unit in Singapore. The story begins somewhere in 2004 when we decided to procure a server for running some of the internal IT services used for blood donor recruitment, recognition, retention and recall. Over time, the server got loaded with more and more services and it was beginning to look like a crash could mean a lot of downtime.

It is not the case that we didn’t have any backup. With the IT budget that was available, I had configured a soft raid of level 1(replication) for two of the critical partitions of the server – /home and /var. All was well until last year when we began taking regular backups of important configuration files and storing them remotely conforming to business continuity best practices. Late last year, I decided to take local full backups.

The first sign of trouble seem to have been visible last year. As part of the backup process, we wanted to use an external usb hard disk. However, the server did not recognize the hard disk and create the respective /dev/ entries although dmesg was showing the usb device as being recognized. with no /dev/ entry the hard disk could not be mounted. I concluded at that time that it could be because of the mismatch between old usb1.x port and the new usb2 hard disk. I wasn’t very convinced with that hypothesis but I didn’t have any alternate theories as well. Now in hind sight I know this hypothesis is not right.

One fine day, the system console stopped responding normally and there was huge delays between echos for key strokes. I suspected a run-away process with potential memory leaks so just hit the reboot. On POST screen, BIOS reported that one of the hard disks is reporting a future failure (a SMART feature of new generation hard disks). Following this, the kernel refused to boot and dropped me into the maintenance prompt. After running fsck for well over an hour with hundreds of inodes being orphaned and added to the lost+found list, finally the server started booting…

During the startup, two daemonshttpd and mysqld did not come up owning to some missing files. These two services are key to all our web apps and thus the server came to a stand still. Coupled with this was the fact that the server stopped at console-login and did not start up in X although the runlevel listed the server being in 5. I decided it was not worthy to continue using this server as what all was broken could not be cumulatively listed let alone being fixed.

I didn’t even want to reboot since my intuition suspected that the server might never come up again. With the local external hard disk not recognizable the only option was to do a remote backup. Luckily, ssh was up and I was able to connect from outside to the server. However, I soon found out that scp was not found in the PATH and not on the filesystem. I did a which sftp and it was found. With a sigh of relief, I started to connect from outside. The relief was short lived because although the control connection could be established the data connection was not being created. There was no firewall and no other network blockage. I thought the missing scp could be the cause. Therefore, it seems that the only option was to dismantle the server and recover the data.

Luckily, one of my friends (Kumar c/o thoughtsware) accidentally discovered that although pull technique was not working, push technique using ftp after ssh-in was working. Armed with this, I scrambled to setup with a ftp server to be the receiving end. By setup, I mean it, as I don’t usually use ftp because it is insecure. I started to move data one file a time from the server to this ftp site. The data was flowing but at a very slow rate. After a few days of data transfer, I decided it was time for the surgical data recovery on site. Another good friend and a mentor, Harish from Red Hat, offered to help out with the recovery.

First, I wanted to test out if a flash drive was readable on the server. To my surprise, the crippled server was able to recognise the fat32 filesystem drive and created the necessary /dev/ entries. At this stage, I concluded that there was nothing wrong with the server hardware and something was not okay in the software when I earlier tried a ext3 formatted disk. I started copying the data into the thumb drive but very soon the data transfer halted with the process becoming non-responsive as well. We guessed the process may have hit a bad sector. We decided it was time to shutdown the broken OS and try our last alternative.

Our plan was that we boot the server using a live cd, Fedora 10 in this case. Then mount an external hard disk and the internal hard disks and dd out the partitions and attempt a system recovery using the binary img files. The only unknown factor being the partition was soft raid partitions and we haven’t done a recovery before on this filesystem type.

We booted the server using Fedora 10 live cd and successfully mounted the internal hard disks. Next, when we inserted the same external hard disk into the usb port, Fedora mounted it without any fuss. At this point, I infered that earlier attempts to mount it may have failed because root filesystem may have developed bad sectors at exactly the same spot where the entries are created.

Without any delay, using dd we copied all the 4 raw partitions (2 raw partitions per raid 1 partition) to the external hard disk. Our attempts to copy out the boot and root partitions did not succeed as dd complained of missing superblocks info. After copying the raid partitions out, we decided to try our luck with booting the server back into the original, now broken OS and try out copy out configuration files by mounted a thumb drive as earlier. However, on reboot the server didn’t go anywhere beyond grub prompt. The server was totalled.

Armed with the dd created partition binary files, I mounted the hard disk containing them in another Fedora 10 machine. Then I used losetup to mount the binary files as a loop back device. I then used mdadm to combine the respective component partitions and mounted them as a md0 device. Following this, I mounted this md0 device under a directory and was able to see the files. Finally, I performed a recursive copy of the data in these mounted partitions into a clean ext3 partition for restoring. In the end, we could not recover a few configuration files like apache, networking and such. However, majority of the data was recovered.

I am now awaiting the hardware vendor to replace the parts and hopefully the server would be online in a few days.

Based on this experience, new knowledge and technologies, our new server would be designed to be more reliable than the current one and in case of another crash, should be easier to recover as well. Our idea is to use virtualisation and run our current server in a VM. By this way, we can easily backup the entire server and also restore from backups more easily. Secondly, we can enforce security policies on the host OS and also on the guest OS while monitoring all the network traffic flowing to VM at the host level. The host will be responsible for providing a abstracted and redundant hardware layer to the VM taking care of the raid and other hardware specific features.

Feel free to comment on our recovery experience and the new server design strategy as well :)

Category: Computer Science 3 comments »

3 Responses to “A Linux crash, burn and recover experience…”

  1. Yokin Hudson

    Reading your story , I can imagine your feeling when you lost such a critical data at server. I have also gone through such a drastic condition at university level where we keep records of all the students and staff. But just like you , I got the the solution for that as file recovery . Thanks to god that it solved all my trouble. It recovered all the lost data in the original format.
    Its really nice to share this experience with you.

  2. S P T Krishnan

    Hi Yokin Hudson,

    Thanks for sharing the link to Stellar Phoenix software. I have also used the same software several years ago successfully to recover data from a windows partition.

    cheers,
    –Kris

  3. Virtualisation and missing RAM « S. P. T. Krishnan’s take on things…

    [...] 10, 2009 by S P T Krishnan This is part 2 to my earlier post “A Linux crash, burn and recover experience“. I had mentioned in there that I plan to use a Virtual Machine (VM) strategy for easy backup [...]


Leave a Reply



Back to top