Date: July 18, 2013

Introduction

For several weeks, my system disk has been reporting errors. I am seeing things like:
smartd[396]: Device: /dev/sda [SAT], 1 Offline uncorrectable sectors
Also I have seen a stronger message (but cannot seem to find it in my logs) that tell me that the disk is failing.

Naturally this makes me nervous. This is a 1.5T Western Digital "black" SATA drive. I purchased a 2.0T Seagate SATA drive to replace it. Note that the Seagate drive has a 2 year warranty, whereas the Western Digital has a 5 year. We will see how warranty replacement goes with the Western Digital.

Copying the system

I use fdisk to partition the disk as follows: I made all of these primary partitions, and changed the partition type for the swap partition to 0x82, then did:
mkfs -t ext4 /dev/sdb1
mkswap /dev/sdb2
mkfs -t ext4 /dev/sdb3
mkfs -t ext4 /dev/sdb4
Then I copy over my system via:
mkdir /x ; mount /dev/sdb1 /x
cd /boot; find ./ -xdev -print0 | cpio -pa0V /x
umount /x ; mount /dev/sdb3 /x
cd /; find ./ -xdev -print0 | cpio -pa0V /x
umount /x ; mount /dev/sdb4 /x
cd /u1; find ./ -xdev -print0 | cpio -pa0V /x
umount /x
Then some time later I do:
mount /dev/sdb4 /x
rsync -av /u1/ /x
After this I edit fstab and let the /u1 filesystem on the new drive get mounted as the active user area while I sort out how to get this disk set up to boot.

Getting the new disk ready to boot

There are really only two things that should need to be tweaked before I can move cables around and boot from this new disk, namely fstab and grub.

I have abandoned using UUID specification in the /etc/fstab file and just spell out /dev/sdb4 or whatever, which I find much more user friendly by a long shot. Since I decided not to use an extended partition when setting up my new disk, I must now change /dev/sda5 to /dev/sda4 in one place, which is easy.

Grub

There are two aspects to getting grub set up. The first is getting grub itself into the MBR of the new disk. I do this by:
mount /dev/sdb3 /x
mount /dev/sdb1 /x/boot
grub2-install --root-directory=/x /dev/sdb
The second part is rewriting a bunch of UUID entries in the file /boot/grub2/grub.cfg. This file contains stern warnings not to do this (edit this file by hand), but it works for me.

Probably after I get up and running on the new disk, I should do:

grub2-mkconfig -o /boot/grub2/grub.cfg

To figure out what UUID values correspond to partitions of the old and new disk, you should look at:

ls -l /dev/disk/by-uuid/
The hand editing of grub.cfg is somewhat of a pain because you have to copy and paste UUID's that you get by inspecting /dev/disk/by-uuid. I only change 3 entries. I find the boot entry for the current kernel, and ignore all the others. I also don't need to change the UUID in the label line. I do need to change the UUID for the /boot partition in two places, and the root in one.

I wrote a ruby script to automate this, for all the good it did me. I reads from /boot/grub2/grub.cfg and writes the modified file to stdout.

Things get weird

The problem is that when I try to boot the top grub entry, it starts, but after the line
[OK} Reached target Basic System
It just stops dead. I tried some of the other entries in the grub menu, and (lo and behold) my old Fedora 18 kernel boots up fine! I have no clue what is going on, but I figure this is a good chance to run the command:
grub2-mkconfig -o /boot/grub2/grub.cfg
This runs fine (and tells me about various kernels it is finding), but when I reboot, rhe grub menu is entirely different and only has 2 entries (despite all the messages about lots of different kernels). The top entry simply says "Fedora". The second entry says "Advanced options for Fedora", and (I never would have suspected without examining the grub.cfg file), if you click on it, you get a menu with all the assorted new and old kernels.

When I try to boot now with just the new drive cabled up, we stop at the point described above.

And now things get even weirder. I connect the cables back to my original disk and boot the system. I get the same truncated menu (as if it is booting from /dev/sdb), and I get the root filesystem from /dev/sdb3 (but /boot is mounted from /dev/sda1). I use fdisk to verify which disk is which and /dev/sda is the old 1.5T disk and /dev/sdb is the new 2T disk -- just as things should be. And I edit /etc/fstab so it tells it to mount /boot from /dev/sdb1 and reboot. Voila! I am running entirely on my new disk!

I now have lots of questions, foremost is why the system is now booting from /dev/sdb instead of /dev/sda, if that is in fact what is going on.

But I can run the system on just my new disk (but I can only run an old F18 kernel on my F19 system). Maybe we can leverage that to get things right. Doing this doesn't yield any improvement:

grub2-install --root-directory=/x /dev/sdb

Things get fixed

Next we try (while running my old F18 kernel):
yum erase kernel-3.9.9
yum install kernel-3.9.9
And this does the trick. The system now boots with only the new drive in the system and boots the latest kernel without any human intervention. The only oddity is that the Grub menu now shows only one entry:
"Advanced options for Fedora"
Which is in fact a menu with many entries hidden inside it. Happily the top entry is the latest kernel and if I do nothing gets booted by default. There are lots of unsolved mysteries, but we finally got where we wanted. Note though that if I hadn't left an old F18 kernel laying around in /boot, I would have faced some kind of trickier challenge. No, I don't understand what is going on.
Have any comments? Questions? Drop me a line!

Adventures in Computing / tom@mmto.org