Sunday, October 28, 2007

How to break things in interesting ways

Well, over the last 15 hours or so, I’ve broken stuff twice, inside of a VM thankfully.

I’ve been trying to do some testing to ensure that the pvmove command is going to work ok on my main system, so thought I would test this first within a vm. First test went just fine. Thought I’d test another pvmove as well, so did, and while I was doing that, decided to put a little disk load on the system by copying the /lib/modules/kernel files. All was going sweet.. So I decided to remove that files again, however, being inside a vm, it has the occasional issue with keyboard repeat, or something, anyway, long story short, ran rm -rf / mytestkernel modules. See the mistake! an extra space. It killed the system pretty quick.. oh, and pvmove fails if you trash /dev and /proc while it’s running… not really surprising!

So, lucky it was a VM.. unlucky, I didn’t have a snapshot. So, in the morning rebuilt a new vm, ready to try the test again.. Now, on my new vm, updated it, and created an LVM with a raid1 disk. Then added a raid0 device and rebooted. Ouch… did not reboot. I got an interesting message tho: “lvm locking type 1 failed”. After much playing, booted from a rescue cd and tried to mount up the filesystems.

Anyone booting from a rescue cd should know the drill: bring up the raid, using mdadm –assemble –scan if needed, modprobe dm-mod dm-mirror dm-zero, vgs, vgchange -a . mount /dev/system/root /mnt/root

chroot /mnt/root

now, an interesting problem I had, but found the solution to is: how to run mkinitrd from inside a chroot. Every time I’ve tried it, I would mount /prod, run it. it would use 100% cpu and go very slow. I found out why. you need to mount /sys within the chroot also. this can be done using mount -t sysfs /sys /sys

then mkinitrd will run. however, I am getting ahead of myself. The problem, I discovered was that the raid0 module was not in my initrd, so when I booted, my raid0 device would not come up, which resulted in lvm not having all it’s physical volumes, so it would not start, so it could not operate properly.

The solution I used before finding this out was to just remove my raid one device which did not have any extents allocated to it from the volume group.. When I did this, the system booted fine.

So, end story was to run mkinitrd on the system after starting a raid0 device. This works because mkinitrd looks at the modules loaded in the running kernel and included them if it finds they are needed.

oh. and have now made a snapshot, so can go break some more stuff!