SF v490/S10 Server reboots unexpectedly

We have 4 SunFire v490 servers running Oracle 10g RAC between 2 different sites: site01 and site02. The servers in site02 are rebooting unexpectedly every now and then without leaving any log info in /var/adm/messages or on the console. I was pulling my hair off trying to find the problem, but I am stuck. Some people told me that Oracle processes will reboot the server if they decide to, but again there is no trace of a decision like this in the Oracle processes log. I suspect a problem with the temperature on the CPU/servers, but I have no means to base my theory, because there is nothing in the logs. We have some custom processes that poll the temperature on the CPUs (using prtdiag -v) every 5 min, but again we never get alarmed on that. It looks like the server crashes (no crashdump file, although configured) and came back to life by itself

Is there a way to increase the debug level on the logs for a crash or reboot, something that may be told at the OBP level to debug on the console any events that would lead to a server reboot?

The systems are running Solaris 10, Release 5.10 Version Generic_118822-25 64-bit. Bellow is the output from prtdiag:

System Configuration: Sun Microsystems sun4u Sun Fire V490

System clock frequency: 150 MHz

Memory size: 8192 Megabytes

========================= CPUs ===============================================

RunE$ CPUCPU

Brd CPU MHzMB Impl.Mask

-- - - - -

A 0, 16 1500 32.0 US-IV+2.2

A 2, 18 1500 32.0 US-IV+2.2

========================= Memory Configuration ===============================

Logical Logical Logical

MCBankBankBank DIMMInterleave Interleaved

Brd IDnumsizeStatusSizeFactorwith

---- --

A001024MBno_status512MB8-way0

A011024MBno_status512MB8-way0

A021024MBno_status512MB8-way0

A031024MBno_status512MB8-way0

A201024MBno_status512MB8-way0

A211024MBno_status512MB8-way0

A221024MBno_status512MB8-way0

A231024MBno_status512MB8-way0

========================= IO Cards =========================

Bus Max

IO Port BusFreq Bus Dev,

Type ID Side Slot MHz Freq Func State NameModel

- - - - - - - -- -- -

PCI8B43333 4,0 okpci-pci8086,537c.7/network (netw+ PCI-BRIDGE

PCI8B43333 0,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8B43333 1,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8B43333 2,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8B43333 3,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8B53333 5,0 okpci-pci8086,537c.7/network (netw+ PCI-BRIDGE

PCI8B53333 0,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8B53333 1,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8B53333 2,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8B53333 3,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+

PCI8A06666 1,0 okSUNW,qlc-pci1077,2312.1077.10a.2+

PCI8A06666 1,1 okSUNW,qlc-pci1077,2312.1077.10a.2+

PCI8A16666 2,0 okSUNW,qlc-pci1077,2312.1077.10a.2+

PCI8A16666 2,1 okSUNW,qlc-pci1077,2312.1077.10a.2+

Any help is very much appreciated,

Adrian

[3224 byte] By [Newbay] at [2007-11-26 8:47:49]
# 1

Download and install Sun's Explorer data gathering tool

http://docs.sun.com/app/docs/coll/1554.1

Run it, then open a service case with Sun and open a service case with Oracle.

Tell each of them that you have the output file from Explorer,

and expect to send it to them for analysis.

By the way, a verbose PRTDIAG would tell you some temperatures

at the instant the command is run.

# prtdiag -v

However, if there are no temperature warnings in MESSAGES,

then I would not expect that you have any temperature issues.

rukbat at 2007-7-6 22:35:28 > top of Java-index,Solaris Operating System,Solaris 10 Features...
# 2

Hmm, when you say that there is nothing on the console, how do you determine that? Do you have something which logs the output on the console?

If not it would probably be a good advice to attach something which logs the output from the serial console, or in case you have configured the RSC you could connect to it and run "consolehistory " to display any errors it might have captured.

Also, ensure that you have the latest OBP and that its doing a full diag-test on when it resets, that way you might find certain hardware errors.

The latest Solaris patch cluster might also be a good idea.

7/M.

mAbrante at 2007-7-6 22:35:28 > top of Java-index,Solaris Operating System,Solaris 10 Features...