E450 CPU temps dont make sense

OK, I'm stumped. I've tried everything I can think of up to and including replacing CPU's and I can't figure out what's going on.

I have an Ultra E450 with 4 UltraSparc II CPU's... there is no O/S installed - haven't even been able to get there...

Here's the VERY repeatable problem:

1. A single CPU in first slot (B2) shows 40C if running by itself.

2. Add a 2nd CPU in 2nd slot (A2) and the 2nd CPU shows 80C

3. Add 3rd and 4th CPU's and all but CPU2 (slot B1) shows 80C. CPU2 goes to 40C ?.

All the panels are installed properly and there is significant cool airflow out of the back of the CPU side...

If I go against the grain and install CPU0 and CPU3 - .env-cpus reports both at 80C... same with every possible combination (yes I've tried them all!).

Any one of the CPU modules I have report 40C if its the only CPU installed in slot B2.

This all started after doing a normal boot from CDROM to install Solaris 10 - got to where then installer asks for Disk 2 and the install just stalled... I looked at the console in the booted system and it was spewing read errors off of 2 of the 4 18G disks I have in the machine - I did a shutdown and now I can't even boot CDROM (says Short Read - file does not appear to be executable blah blah blah).

Trying to boot from disk0 (where I know the installer started runs right into overheat warning and shutdown...

probe-pci from OpenBoot causes immediate power-off and all led's go on.

Should I try the 3-story drop-test? If it wasn't so heavy I'd probably have done it already.

Any ideas would be helpful!

Please remember I can't get to anything but OpenBoot at the moment so O/S commands wont do me any good...

Thanks!

-DM

[1797 byte] By [] at [2007-11-25 22:46:55]
# 1

As I've mentioned in your other post from today:

<a href="http://supportforum.sun.com/hardware/index.php?t=msg&amp;th=5375" target="_blank"> http://supportforum.sun.com/hardware/index.php?t=msg&th= 5375</a>

cpu modules must be installed in a particular order and each one needs a functional DC-DC voltage regulator module.

cpu -->>slots J0401, then J0201, then J0301 and finally J0101

DC-DC >>DC-B2, then DC-A2, then DC-B1, and DC-A1 last

Don't worry about reported cpu temperatures while your OBP is at 3.7.x

Investigate after the flash-update and when you can run the box for more than just a few minutes so that everything equalizes.

Perhaps there's a quirky fan assembly or some ribbon cable is changing airflows when the components are inside and you have the chassis closed up.

B.T.W. OBP 3.7.x only shipped on the 501-2996 board.Every other E450 board came just enough later that they had 3.12.x or 3.16.x or newer.Those details can be seem in the Spectrum version of the SSH (not the free version).

Handbook...Components...Open Boot Prom...E450

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 2

Bill,

Thanks for taking the time to respond... still having the same problem.

All the CPU's and their corresponding DC modules were installed with the power off - the different slot tests I did were simply to determine if I had a CPU with a tendancy to overheat (not the case).

So I started from scratch and now All 4 CPU's are installed (while the power is off I assume no particular insert order makes a difference), but all were installed in the order you suggested - just in case there's some magic chip on the motherboard (or God) was watching me...

The problem remains:

1. Can't boot to disk (overheat causes hardware power off)

2. Can't boot to CDROM - short read error...

I reburned the CDROM at a slower speed to see if that made a difference to the 12X drive on the E450.... it doesn't... still get the "short read" error. Not to mention I used the original disk to get the first phase of install done before all this stuff started happening...

Another thing I found - in obdiag running the selftest against the Env-monitor complains more than my ex-wife. All sorts of Errors (ambient - cpu temp delta out of range, etc. - mostly "Timeout on BUS CONTROLLER PIN bit"

All the fans are peachy and all the cables seem well seated...

Is the Env monitor an ancillary device on the MB that I can reseat or otherwise check? Maybe it's toast? I have tried to boot with env-monitor disabled - doesn't make a difference.

obdiag returns the following after a "what 1"

/<a href="mailto:pci&#64;1f" target="_blank">pci@1f</a>,4000/<a href="mailto:ebus&#64;1" target="_blank">ebus@1</a>/SUNW,<a href="mailto:envctrl&#64;14" target="_blank">envctrl@14</a>,600000

name: SUNW,envctrl

alias: no property

device_type: no property

compatible: no property

model : no property

fru : motherboard, power_supply, fan, cpu

status : no property

Thanks

-DM

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 3

... it <i>might</i> be the motherboard, itself that is questionable to the point where it could be accused of being "flaky".

The function of the environmental monitoring and reporting circuitry is essentially hard-coded and you cannot modify it or turn it off.

<img src="images/smiley_icons/icon_sad.gif" border=0 alt="Sad">

The cpu installation order is only important if there are fewer than four cpu's in the system.

Try to get the computer running with only one cpu.It might permit a simple Solaris install to a single disk.

Maybe you can get it running well enough to get the OBP patched, and if the behaviour changes for the better, then you could consider reinstalling the OS in whatever configuration you'd originally hoped.

Your cpu modules do have the same part number, I hope?It's not as critical an issue as in E420R systems, but try to avoid mixing cpu part numbers, even at the same cpu speed.(definitely cannot mix cpu speeds in an E450)

--

lastly, there might be some extremely inexpensive ways to get a replacement systemboard if you must (read: Ebay, and the like).

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 4

Bill,

Thanks again - this system was running debian in a different life then I inherited it as a development platform... So I know that not too long ago it was operational.

I can't believe the poor guy who had this before me paid over $70K for it in '98... (I just paid shipping :) )

I'm gonna do a "once-over" again on all the connections and seat everything again... if it don't come to life I'm going to ship it to my mother as a planter for her garden....

Thanks,

-DM

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 5

Ok here's another stumper...

I tried like you said to install a single CPU and get it to boot...

I installed 1 CPU in J0401 (bottom-most slot - indicated by all the labels as 1st CPU slot) and it's reporting it as CPU3

Not to mention it's reporting 80C (always 80 or 40 never anything in between)... and it bombed on "boot disk" with an overheat shutdown.

I just don't get it... I think this is why I like my Supermicro Quad Xeon EM64T board... no thermal issues whatsoever.

-DM

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 6

Ignore the reference to "cpu3" as it's just a distraction. You had the module in the proper position.

Yup, I'm stumped.

Hmmm.....There is an assembly of three cpu fans, and there is an assembly of disk drive fans, and you've got a system with OBP 3.7.x ......(which places it at a certain vintage.)

During the first couple of months that the E450 shipped, there was a power supply blower fan ABOVE the power supplies that could be reach by taking off the right side panel. You may not have had any reason to remove that chassis panel as yet.

That blower item was discontinued and just not used at all, very soon after the E450 began shipping. Perhaps you've got one of the dozen systems that has the fan and it no longer works, but it's still plugged into its socket and passing along a signal that something is broken.

Do you have login access to Sunsolve, so as to have deeper access into the Sun System Handbook?That would be the only way to get to a diagram I'd like to reference:

Handbook...Systems ...E450 ...

System-Views-and-Components ...Exploded View ...item #5 of that picture.

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 7
Looks like the 450's had an issue with the seating of fan/interlock cable assembly at connector J0307 on the (lower right) of Power Distribution board.Reseat and possible Fab. a new one.What are the CPU part numbers?
at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 8

The CPU Part numbers all match at 4849-03 REV 56, and all the Serial Numbers are within a couple hundred digits of each other (but the same where it counts - 4849A03XXX).

This is getting beyond weird tho...

A single CPU installed (in the B1 "First CPU" slot) shows 80C on .env-cpus - but take that same CPU and slap it into A1 "3rd CPU" slot) and the sucker runs at 40C to 42C.

A second CPU installed in slot A2 (the official 2nd CPU slot) then the temp on the CPU in B1 is 40C and the temp in the recently installed A1 is 80C. ? someone must be playing with me I think....

So against all Sun Wisdom (as opposed conventional wisdom), and utilizing a few dollars worth of my engineering degree I constructed a fan duct out of a 2 liter coke bottle, some anti-static pouch material - and a couple of strategically located strips of tape.... I directed my contraption (remeniscent of Apollo 13's ad-hoc CO2 scrubber adapter) with all the force of the lower fan on the B1 Slot CPU.... and Guess What?Nada, zero, zilch, nothing... the sucker still chimes in at 80C and there's about 1500 CFM coming through the fins.

I'm about to try a single CPU in slot 3 (A1 for those keeping track)... I don't care if the system thinks the CPU should be in another slot or not - I was able to install Debian linux on it last night using this method - but it's Solaris I need for our developers to play with....

I was able to get up to the Solaris install dialog with CPUs installed in the 2 topmost slots (A1 and A2) both CPUs were chugging along at 40C... then it eventually overheated and shutdown because the fans were at low RPM).

I could sure use some dynamite right now.... or nice big trebuchet....

-DM

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 9

BE VERY CAREFUL HERE! You might have meltdown.

You can adjust the env monitoring at OBP. If you are sure that these tempature readings are bogus, I would try this.

With one cpu installed:

ok> setenv env-monitor=disable

ok> setenv critical-temperature 85(default is 79)

Boot cdrom and get your OS installed...Upgrade OBP. Re-enable env-monitor'ing.

at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 10

We have actual temperatures of 18 degree celcius and our server shows 80 degree farenhit (27 degree C ) .

It started with warnings from 2 days ahead . Temperatures according to me are perfect at 18 degree C .

I wish to know what are the temperature requirement for E450 server . Can I test the server at 80 degree F by overriding some settings

HiteshRasiklalShah at 2007-7-5 17:02:16 > top of Java-index,Sun Hardware,Servers - General Discussion...