x2100 freeze
We have an x2100 that has started freezing very frequently (stays up for a couple of hours or so). I tried pulling out a couple of memory sticks to see if that made any difference. That made no difference. There are a few errors in /var/adm/messages regarding disk read errors, but metastat and zpool report no errors. I have only a few days remaining on the standard one year warranty. Any ideas on what would be the most likely cause of the freezes?
To give you a little history, the machine froze last Friday (2007-03-02) night, and I went to the data centre and power cycled the machine on Saturday. It froze again early this morning (2007-03-08 01:30 or thereabouts), and I powercycled the machine this morning. By the time I got back to work the machine had frozen up again. I went back to the data centre in the afternoon and tried one more powercyle and also took out a couple of memory sticks. Once again the machine had frozen up by the time I got back to work.
All thoughts appreciated.
Thanks
Rakesh
[1042 byte] By [
rakeshva] at [2007-11-26 21:00:14]

# 1
I've seen this on a couple of our x2100s (we have 268 of them). The disk drive looses contact with the disk backplane, and the OS freezes because it no longer has a filesystem (we run linux). On linux, you'll see disk errors.
The temporary fix is to reseat the hard drive. The permanent fix is to have the disk backplane replaced under warranty. SUN engineering is working on a redesigned part to eliminate this issue.
# 2
We have 4 X2100s and have been getting similar attitude - especially machines with moderate I/O appear to have dissappearing hard disk drives. We changed the SATA I/O back panel on two of them but that hasn't fixed the problem. Using the Sun Fire X2100 Diagnostic CD we always get a FAILED drive read rate test on all 4 machines.
Currently working with Sun to figure this out...
# 3
We've replaced the disk backplanes on 5 of the systems, and there hasn't been a problem since.
The disk drives "disappeared" under 100% cpu load with associated high I/O rates. It almost seems like the drives work themselves away from the backplane connector just enough to loose contact.
Personally, I suspect that the tiny CPU/power supply fans which run at very high speeds under heavy load cause enough vibration to move the disks slightly.
The newer x2100M2 system has redesigned fans, and we haven't seen any problems on those (we have 6 so far).
SUN took the bad backplanes with them and sent them to SUN engineering for a diagnosis, but I haven't heard anything back from them yet.
# 4
Thanks for all the replies. After a support call to Sun, the engineer recommended that I disconnect the SATA cables from the backplane, power the system up and down, and then reconnect the cables and go. So far it seems to be working. According to him in most cases this clears up the issue.
The machine has been running since yesterday with no disk read errors/warnings yet. It is hard to say if the problem is solved, since the machine is not currently working as a production server. The temporary server I put in (an old trusty ultra60) is now backing up stuff back to the x2100, but other than that it is doing nothing. I plan on observing the machine for a week and then if stable put it back into production.