Sometimes it’s best NOT to follow directions…

Had a problem with my main Solaris server in Taipei today which runs Solaris 10 6/06. It hung and when I rebooted it came up with this error:

WARNING – The following files in / differ from the boot archive:
cannot find: /etc/devices/mdi_ib_cache: No such file or directory
/kernel/drv/md.conf
The recommended action is to reboot and select “Solaris failsafe”
option from the boot menu. Then follow prompts to update the
boot archive.

I rebooted into failsafe mode and blindly followed the directions given:

/dev/dsk/c0d0s0 is under md control, skipping.
To manually recover the boot archive on a root mirror,mount the first
side (the one that the system boots from) and run:

bootadm update-archive -R <mount_point>

In summary, I did the following:

# mount /dev/dsk/c0d0s0 /mnt
# bootadm update-archive -R /mnt
Creating ram disk on /mnt
updating /mnt/platform/i86pc/boot_archive…this may take a minute
# reboot

WRONG! WRONG! WRONG!

This ended up updating the boot archive on only one side of the mirror. The other side of the mirror was not modified. Hence, the mirrors became out of sync. However, the configuration is still set to boot from a mirrored device. This quickly gets you to this point on reboot:

NOTICE: /: unexpected free inode 9864, run fsck(1M)
The / file system (/dev/md/rdsk/d1) is being checked.

WARNING – Unable to repair the / filesystem. Run fsck
manually (fsck -F ufs /dev/md/rdsk/d1).

Yes, you really only need to update the boot archive on the first device in the mirror as that’s the one the system boots from. However, once it’s bootstrapped the kernel, it’s going to mount the full mirror, and the full mirror now has different contents on each side. Depending on what’s read from which side of the mirror, you’re likely to end up with some inconsistency detected.

The correct way to do things would be:

# mount /dev/dsk/c0d0s0 /mnt
# vi /mnt/etc/vfstab
(Change root mount device to the unmirrored device /dev/dsk/c0d0s0.)
# bootadm update-archive -R /mnt
Creating ram disk on /mnt
updating /mnt/platform/i86pc/boot_archive…this may take a minute
# reboot

You would then need to rebuild the mirrored root after you get the system back up.

However, that is not the end of the story. It turns out that all this boot archive rebuilding won’t fix this particular problem. This error message is normally generated when the files in the bootstrap image don’t match those in the actual filesystem. This is your heads up that the state during bootstrap doesn’t match what’s on the actual root filesystem. Going through the exercise of rebuilding the boot archive is supposed to get things back to a point where the bootstrap image and the filesystem match.

However, in this case the file /etc/devices/mdi_ib_cache is missing on the actual root filesystem. So the error message is actually wrong. If you rebuild the boot archive it’ll fail to add the file, because it doesn’t exist. And the next time you boot it’ll give you the same error again. The error is that a file is missing on the actual root filesystem, not that the boot archive doesn’t match the root filesystem.

And it turns out this file is completely unimportant. If it’s screwed up or missing, the system will replace it automatically and merrily go on its way. In other words, it’s absolutely no big deal if it’s missing on reboot.

The original error message I saw also had this advice:

To continue booting at your own risk, clear the service:
# svcadm clear system/boot-archive

This ‘at your own risk’ option actually turns out in this case to be the correct remedy for this problem.

So to summarize:

The original error message misstates the problem as files being different instead of one being missing
The recommended fix does not solve the problem
The instructions for the recommended fix gives specific advice for mirrored filesystems that will damage your filesystem and waste lots of time undoing the damage
The missing file is actually completely unimportant
The ‘at your own risk’ option is the correct way to solve the problem

It looks like everything but point 3 is covered by BugID 6256649, however the public description is not useful. I didn’t find a bug report covering the problem with the instructions for rebuilding the boot archive on mirrored filesystems being wrong.

I also don’t know why the thing hung in the first place. Nothing in /var/adm/messages.

13 thoughts on “Sometimes it’s best NOT to follow directions…”

miltownkid says:

2006年09月18日 2006-09-18 at 08:08

I’m happy to say that I’m nerd enough to follow all of that
kaifu says:

2006年09月18日 2006-09-18 at 19:10

my head hurt from reading all that.

so, i’m just gonna ask if you could install the latest perl mods Image::Info and Image::ExifTool globally for makeslide.
jlick says:

2006年09月18日 2006-09-18 at 23:09

Oh boy. You do realize that upgrading those two modules probably entails upgrading a couple dozen other interdependent modules. Oh how I love perl dependency hell.
jlick says:

2006年09月19日 2006-09-19 at 21:00

OK, Image::Info and Image::ExifTool have been upgraded.
kaifu says:

2006年10月01日 2006-10-01 at 07:11

wow, this thing remembers my name. i’m so easily impressed these days.

thanks for the modules. hopefully you didn’t have to upgrade to perl 6 for that.
Chris Kern says:

2006年10月07日 2006-10-07 at 03:25

Thank you for posting this. There was some sort of power hiccup in my office last night and I arrived at work to discover that my Dell workstation’s Sol 10 06/06 reboot had hung with the diagnostic error message and bogus instructions you quote at the top of this post. Found your write-up with a Google search, and certainly saved myself some time and probably some trouble I would have experienced if I had followed the instructions on the screen.
jlick says:

2006年10月07日 2006-10-07 at 15:43

Glad to help. Part of the reason I posted it was because I only found a couple of vague references to this error when I searched. Hopefully this will save other people the pain I went through.
igor says:

2007年02月14日 2007-02-14 at 22:49

Many thanks. It helped! Greetings from Switzerland, Igor
Razvan says:

2007年04月27日 2007-04-27 at 01:31

Hi James!
After stumbling upon the same error and messages, I went through exactly the same discovery process as the one you described it.
I also assumed that Solaris engineers knew better than I did what needs to be done, so I booted in failsafe mode and updated the boot archive after mounting the slice corresponding first device in the mirror.
I ended up also having to fsck the device after booting – but then somehow I got to a point where GRUB was panicking with a weird message: “panic: cannot open kernel/amd64/unix” – although the file was obviously there.

What eventually fixed the problem was to manually edit the /etc/vfstab on the first slice of ther mirror after booting in failsafe mode, to mount the root from the slice itself and not the mirror device. But then I had to rebuild all the mirror devices that I had on the system, which took another reboot and a few hours to complete.

I’m guessing that you might avoid this ordeal if you mount the mirror device instead (i.e. /dev/dsk/md/???) after booting in failsafe mode – but the problem is that the failsafe mode boot archive doesn’t support the md devices – i.e. if you say metadb – it will complain about a missing device). So you would need to change the boot archive for failsafe mode.

Thanks for your post. I hope this will help people having the same problem – and I hope Sun will correct the problem in the next release.
jlick says:

2007年04月27日 2007-04-27 at 12:01

Yeah, an fsck can’t really fix the problem since the two sides of the mirror are out of sync, and fsck will only look at whichever blocks were randomly selected by the volume manager. The volume manager foolishly assumes that the two sides of the mirror will always be the same, which isn’t the case if you’ve written to only one side of the mirror. If they ever get zfs-root working that’ll probably avoid a lot of problems since the volume manager, file system and raid function are unified in one stack.

I still think the best solution is the clear the boot-archive service and see if the system boots. If it does, rebuild the bootadm then and reboot. I’d be willing to bet that most of the time bootadm being out of sync won’t be serious enough for the boot to fail, and that method should only need a few minutes to remedy if so.
CIT says:

2009年01月09日 2009-01-09 at 01:03

you were right…..don’t follow the instructions…
just clean the boot archive (# svcadm clear system/boot-archive)

that solved it for me and I had this error message instead:
warning the following files differ from the boot archive changed /kernel/drv/amd64/zfs
Sunny Chen says:

2009年07月07日 2009-07-07 at 08:01

Hello James,

Thank you very much for your posting, it just helped to solve our server boot-up issue today ( actually my colleague found your blog and followed your suggestion to boot up the server ).

Thank you very much.
jlick says:

2009年07月08日 2009-07-08 at 00:58

I’m surprised this is still a problem!

13 thoughts on “Sometimes it’s best NOT to follow directions…”

Leave a Reply