PS0 fault -> power off. What happened?
Hi,
The server unexpectedly switched off yesterday morning.
The only one log line I found, by picld, said
WARNING: Device PS0 failure detected by sensor PS0_FAULT_SENSOR.
The system is a SunFire 280R and has two power supplies. When I reached
it, some 18 hours later, the general fault amber led was alight and everything
else was off. I turned the system on without errors, except fsck fixes.
Everything on prtpicl had an ok status after boot.
What is responsible for shutting down the system that way?
Should I disable it (how?) or is there a chance that it did the Right Thing?
PS0 and PS1 are connected to two different UPSes, neither of which
reported a glint. CPU temperature wasn't very good: 80^C (176^F) during the
night and perhaps a bit higher by the time the system crashed. However, I
found no logs about temperature either. The system only has one CPU, and
it has been running unmolested for the previous 13 months. Is there
anything I can do to ensure this kind of crash won't happen again?
Please help
[1124 byte] By [
vesely99a] at [2007-11-27 0:28:44]

# 3
I still have no clue about what happened.
I understand picld is a monitoring daemon, and it wrote a warning. Hence I don't think it shut the server off.
Was it the hardware or some software very close to the iron?
The documentation about the general fault is very vague... What does it take to have a reasonable behavior?
# 5
I'm not sure about the PN, but it has two PS's in the front panel.
They are both running OK. I don't recall if I had to explicitly ask
for redundant power supply, but that feature is very convenient,
other ways one should turn the system off for servicing the UPS...
I've found a paragraph on the owner manual that I didn't pay
much attention until now. It says: "An individual power supply
will shut down itself at an internal temperature of approximately
90 癈 (194 癋), depending on the ambient temperature, system
loading, and the availability of a redundant power supply."
What does that mean? I found no PS temperature sensor on
the I2C monitoring. However, those fuzzy dependencies seem
to imply some sort of software control. Don't they?
# 6
It happened again today!
This time I was here, in the same room. The noise diminished, suddenly but not much, so I didn't care. After some 20 seconds, the noise almost vanished and I realized that the SunFire 280R had switched off. There are a few machines in the room. The 280R is by far the noisiest with its three fans. Does it make sense to hypothesize that the noise drop I heard was one of the fans?
The temperature was a little bit high, 29^C (84^F), but I was staying well wearing a T shirt and short trousers. All the other equipment continued working fine.
Malfunction amber leds were on in the external panel and on both the power supplies. The interior of the machine was reasonably clean (I hoovered it two months ago.) Again, the single log line I got after booting it is:
Jun 20 13:16:35 picld[102]: [ID 961923 daemon.error] WARNING: Device PS1 failure detected by sensor PS1_FAULT_SENSOR
AFAIK, Power supplies make no perceptible noise, do they?
I lowered the temperature in the room. Does that make the server any more reliable?
# 7
Hmm.. Looks like you are really close to the shutdown thresholds for both the power supplies and the CPU with regards to temperature. What does a /usr/platform/sun4u/sbin/prtdiag -v show for temperatures? You can also use the RSC card if you've configured it. If so, connect to the RSC card and run the showenvironment command. It will tell you information about temperatures and fault lights.
According to the documentation (which you seem to have found), you can operate on one power supply in the event the other has a failure of some kind. Couple of scenarios that might be happening:
1. CPU temperature threshold reached causing the shutdown.
2. Power supply threshold reached on both supplies causing the power off.
3. One power supply is already bad and when the "good" power supply reaches a temperature threshold and shuts down, the server powers off since there is no good power supply running.
According the the 280R documentation, power supplies have a threshold of 90 degrees C and the CPUs start putting warnings out at 75 degrees C. With your ambient temperature being about 84 degrees F, that could very well heat the internals of the box up enough to cause thermal shutdown...especially if the airflow isn't good.
Docs referenced: http://sunsolve.sun.com/data/806/806-4806/pdf/806-4806-10.pdf
Message was edited by:
bryancross