PS0 fault -> power off. What happened?

Hi,

The server unexpectedly switched off yesterday morning.

The only one log line I found, by picld, said

WARNING: Device PS0 failure detected by sensor PS0_FAULT_SENSOR.

The system is a SunFire 280R and has two power supplies. When I reached

it, some 18 hours later, the general fault amber led was alight and everything

else was off. I turned the system on without errors, except fsck fixes.

Everything on prtpicl had an ok status after boot.

What is responsible for shutting down the system that way?

Should I disable it (how?) or is there a chance that it did the Right Thing?

PS0 and PS1 are connected to two different UPSes, neither of which

reported a glint. CPU temperature wasn't very good: 80^C (176^F) during the

night and perhaps a bit higher by the time the system crashed. However, I

found no logs about temperature either. The system only has one CPU, and

it has been running unmolested for the previous 13 months. Is there

anything I can do to ensure this kind of crash won't happen again?

Please help

[1124 byte] By [vesely99a] at [2007-11-27 0:28:44]
# 1
What revision is your picld patch? (110460) I assume for Sol 8.
bmacdoa at 2007-7-11 22:30:24 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 2
Yes, Sol8. I have 110460-30.My /usr/lib/picl/picld is dated 12 May 2004 and comes from 117005-01.
vesely99a at 2007-7-11 22:30:24 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 3

I still have no clue about what happened.

I understand picld is a monitoring daemon, and it wrote a warning. Hence I don't think it shut the server off.

Was it the hardware or some software very close to the iron?

The documentation about the general fault is very vague... What does it take to have a reasonable behavior?

vesely99a at 2007-7-11 22:30:24 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 4
Verson 1 of the PSU's for E220 and E420 servers was not redundant. This was PN 300-1449-01. Don't know if the 280 had the same part.
wsandersa at 2007-7-11 22:30:24 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 5

I'm not sure about the PN, but it has two PS's in the front panel.

They are both running OK. I don't recall if I had to explicitly ask

for redundant power supply, but that feature is very convenient,

other ways one should turn the system off for servicing the UPS...

I've found a paragraph on the owner manual that I didn't pay

much attention until now. It says: "An individual power supply

will shut down itself at an internal temperature of approximately

90 癈 (194 癋), depending on the ambient temperature, system

loading, and the availability of a redundant power supply."

What does that mean? I found no PS temperature sensor on

the I2C monitoring. However, those fuzzy dependencies seem

to imply some sort of software control. Don't they?

vesely99a at 2007-7-11 22:30:24 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 6

It happened again today!

This time I was here, in the same room. The noise diminished, suddenly but not much, so I didn't care. After some 20 seconds, the noise almost vanished and I realized that the SunFire 280R had switched off. There are a few machines in the room. The 280R is by far the noisiest with its three fans. Does it make sense to hypothesize that the noise drop I heard was one of the fans?

The temperature was a little bit high, 29^C (84^F), but I was staying well wearing a T shirt and short trousers. All the other equipment continued working fine.

Malfunction amber leds were on in the external panel and on both the power supplies. The interior of the machine was reasonably clean (I hoovered it two months ago.) Again, the single log line I got after booting it is:

Jun 20 13:16:35 picld[102]: [ID 961923 daemon.error] WARNING: Device PS1 failure detected by sensor PS1_FAULT_SENSOR

AFAIK, Power supplies make no perceptible noise, do they?

I lowered the temperature in the room. Does that make the server any more reliable?

vesely99a at 2007-7-11 22:30:24 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 7

Hmm.. Looks like you are really close to the shutdown thresholds for both the power supplies and the CPU with regards to temperature. What does a /usr/platform/sun4u/sbin/prtdiag -v show for temperatures? You can also use the RSC card if you've configured it. If so, connect to the RSC card and run the showenvironment command. It will tell you information about temperatures and fault lights.

According to the documentation (which you seem to have found), you can operate on one power supply in the event the other has a failure of some kind. Couple of scenarios that might be happening:

1. CPU temperature threshold reached causing the shutdown.

2. Power supply threshold reached on both supplies causing the power off.

3. One power supply is already bad and when the "good" power supply reaches a temperature threshold and shuts down, the server powers off since there is no good power supply running.

According the the 280R documentation, power supplies have a threshold of 90 degrees C and the CPUs start putting warnings out at 75 degrees C. With your ambient temperature being about 84 degrees F, that could very well heat the internals of the box up enough to cause thermal shutdown...especially if the airflow isn't good.

Docs referenced: http://sunsolve.sun.com/data/806/806-4806/pdf/806-4806-10.pdf

Message was edited by:

bryancross

bryancrossa at 2007-7-11 22:30:24 > top of Java-index,Sun Hardware,Servers - General Discussion...