Computer headaches – 黎建溥 / James Lick

Last night just before going to bed, Rick tells me that he can’t access the net. My servers are hosted there so he routes out through one of my servers. Unfortunately the terminal server flaked out in the middle of December where it would stay up for five minutes and then crash. Needless to say, it’s highly aggravating to have to relogin every five minutes to the console to debug a system problem. That makes remote administration challenging when something really goes wrong. I also have remote power management so I can power cycle a wedged server, but that still doesn’t help with more serious problems that prevent the boot process from completing. And on top of all that the server is running an old beta of Solaris 10 (build 69) that has a few too many bugs. I had wanted to upgrade to the more stable build 72, but the terminal server issue fouled that up. Power cycled the server and was able to ping it for a while, but it never fully came up. Since Rick was at work there wasn’t much possible to do at that point, so I headed to bed after making plans with Rick to debug it when he got home.

Now normally this is one of those situations that would have me up for a couple of hours running through what could have gone wrong, and possible ways to fix those scenarios, and things to try to get the terminal server working again, etc. But amazingly enough I went to sleep almost immediately after turning off the light!

So the next morning(my time)/evening(his time) Rick helps me get the terminal server up at least part way. We were able to get it to the point where it would average 30 minutes of uptime between crashes. That ain’t pretty, but it at least gives a decent amount of time to get in and twiddle with things.

So when I finally get it up and running and connect up to the sick server, I reboot and get told there’s a problem with / and I need to run fsck -o f. Unfortunately it wedges before I can login, and even booting up single user it wedges before the single user shell comes up. Normally that’d be the boot where you do a “boot net -s”, but it turns out the sick server is the only netinstall server set up on the network! (I need to make sure at least two minimal netinstall servers are available at any one time.) Fortunately I still had the Solaris 10 build 10 ISOs on disk, but it takes a while to get from ISO to netinstall server.

Once I did, I was able to get in, fsck repair everything, unmirror root, reboot, then remirror root and swap and get back to business. Pretty straightforward at that point. But annoying that it couldn’t even handle getting into single user without resorting to an external boot device. I suspect the disks might be a problem that caused the filesystem corruption, because it has occasionally had problems before. It has IBM disks, and the IBM disks made before the Hitachi buyout have some issues. They’ve never died outright, so hard to really point definitively to a problem. I probably should upgrade them to a more reliable brand considering the frequency of problems.

So at that point I decide it’s probably a good idea to get going on upgrading my two remaining build 69 servers to build 72 to get better stability. I figured with the terminal server staying up a half hour at a time, I’d be able to get through starting the install and the rest would be pretty much on auto-pilot so it wouldn’t matter if the term server died. So I decide to upgrade the one that is not the router/dhcp/netinstall server and do the other less critical one first.

And oddly enough things went fairly smoothly and the terminal server managed not to crash at all during the upgrade. I’d like to think that the upgrade might have had something to do with it, but the terminal server was crashing before when it only had power and a connection to another Solaris 9 server, no network, no other terminal ports connected. So it’s just one of those annoying things where it doesn’t work and then it suddenly works and you have no idea what the hell caused it. As of now the terminal server has been up for 5 hours and 23 minutes which is a pretty amazing accomplishment considering it was hard pressed to do 5 minutes straight a month ago. So one upgrade down, will go for the other upgrade tomorrow.

One of the problems with upgrading from Solaris 10 build 69 to build 72 is that the dhcp-server service gets fudged up somehow and will give a dependency cycle error on boot up. The solution:

svccfg delete dhcp-server
svccfg import /var/svc/manifest/network/dhcp-server.xml
init 6

Should be just fine after that. You need to reboot, because that problem prevents other services in the multi-user-server subset from starting.

Hopefully when the final release of Solaris 10 comes out (soon I hope), these rough edges will be smoothed out.

Leave a Reply