SF v490/S10 Server reboots unexpectedly
We have 4 SunFire v490 servers running Oracle 10g RAC between 2 different sites: site01 and site02. The servers in site02 are rebooting unexpectedly every now and then without leaving any log info in /var/adm/messages or on the console. I was pulling my hair off trying to find the problem, but I am stuck. Some people told me that Oracle processes will reboot the server if they decide to, but again there is no trace of a decision like this in the Oracle processes log. I suspect a problem with the temperature on the CPU/servers, but I have no means to base my theory, because there is nothing in the logs. We have some custom processes that poll the temperature on the CPUs (using prtdiag -v) every 5 min, but again we never get alarmed on that. It looks like the server crashes (no crashdump file, although configured) and came back to life by itself
Is there a way to increase the debug level on the logs for a crash or reboot, something that may be told at the OBP level to debug on the console any events that would lead to a server reboot?
The systems are running Solaris 10, Release 5.10 Version Generic_118822-25 64-bit. Bellow is the output from prtdiag:
System Configuration: Sun Microsystems sun4u Sun Fire V490
System clock frequency: 150 MHz
Memory size: 8192 Megabytes
========================= CPUs ===============================================
RunE$ CPUCPU
Brd CPU MHzMB Impl.Mask
-- - - - -
A 0, 16 1500 32.0 US-IV+2.2
A 2, 18 1500 32.0 US-IV+2.2
========================= Memory Configuration ===============================
Logical Logical Logical
MCBankBankBank DIMMInterleave Interleaved
Brd IDnumsizeStatusSizeFactorwith
---- --
A001024MBno_status512MB8-way0
A011024MBno_status512MB8-way0
A021024MBno_status512MB8-way0
A031024MBno_status512MB8-way0
A201024MBno_status512MB8-way0
A211024MBno_status512MB8-way0
A221024MBno_status512MB8-way0
A231024MBno_status512MB8-way0
========================= IO Cards =========================
Bus Max
IO Port BusFreq Bus Dev,
Type ID Side Slot MHz Freq Func State NameModel
- - - - - - - -- -- -
PCI8B43333 4,0 okpci-pci8086,537c.7/network (netw+ PCI-BRIDGE
PCI8B43333 0,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8B43333 1,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8B43333 2,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8B43333 3,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8B53333 5,0 okpci-pci8086,537c.7/network (netw+ PCI-BRIDGE
PCI8B53333 0,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8B53333 1,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8B53333 2,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8B53333 3,0 oknetwork-pci100b,35.30 SUNW,pci-x-qge/pci-bri+
PCI8A06666 1,0 okSUNW,qlc-pci1077,2312.1077.10a.2+
PCI8A06666 1,1 okSUNW,qlc-pci1077,2312.1077.10a.2+
PCI8A16666 2,0 okSUNW,qlc-pci1077,2312.1077.10a.2+
PCI8A16666 2,1 okSUNW,qlc-pci1077,2312.1077.10a.2+
Any help is very much appreciated,
Adrian

