Sun Enterprise E4500/E5500 Crash

Can someone shed some light on what caused the system to panic?

This is a database server running Oracle 8i.

I have gone over the Oracle Alert and Trace logs and found it was 4:49:40 pm when Oracle registered an error in trying to write out to disk. No other Oracle errors were recorded leading up to that time, and none were reported from last night's monitoring log.

At 4:49 pm Oracle's Database Write Process was terminated abnormally due to an event(s) about which we will hopefully learn something soon. When Oracle's background Process Monitor registered that a major error had occurred in the DB Write Process, it terminated the Oracle Instance; and all of the Oracle background processes loaded into memory were killed as a necessary step to save database integrity and allow the Database to recover when it was next restarted. It was able to do this before the Server actually went down, but just barely because the trace file and alert log both end abruptly at this point. Oracle did not produce a core file, something it might have done had something gone terribly wrong within Oracle itself.

And I also noticed the system did not create a core file when it came back up. I don't have any type of core information to send to sun.

Thanks

Steve

- LOG FILES:

System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise E4500/E5500

System clock frequency: 100 MHz

Memory size: 8192Mb

========================= CPUs =========================

Run Ecache CPU CPU

Brd CPU Module MHz MB Impl. Mask

- ---

0 0 0 400 8.0 US-II 10.0

0 1 1 400 8.0 US-II 10.0

2 4 0 400 8.0 US-II 10.0

2 5 1 400 8.0 US-II 10.0

4 8 0 400 8.0 US-II 10.0

4 9 1 400 8.0 US-II 10.0

6 12 0 400 8.0 US-II 10.0

6 13 1 400 8.0 US-II 10.0

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 827727 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU13 Dat

a access at TL=0, errID 0x00074c3c.601593d1

Jan 7 16:49:38 dbprod02 AFSR 0x00000000.00200000<UE> AFAR 0x00000001.c90127f8

Jan 7 16:49:38 dbprod02 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x100832760

Jan 7 16:49:38 dbprod02 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03

Jan 7 16:49:38 dbprod02 UDBL Syndrome 0x3 Memory Module Board 6 J3101 J3201 J3301 J3401 J3501 J3601 J3701 J3801

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 851579 kern.warning] WARNING: [AFT1] errID 0x00074c3c.601593d1 Syndrome 0x3

indicates that this may not be a memory module problem

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 304836 kern.info] [AFT2] errID 0x00074c3c.601593d1 PA=0x00000001.c90127f8

Jan 7 16:49:38 dbprod02 E$tag 0x00000000.0c403920 E$State: Shared E$parity 0x06

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x00000000.00000020

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x00000000.00000000 *Bad* PSYND=0x00

ff

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 953641 kern.warning] WARNING: [AFT1] AFAR was derived from UE report, CP eve

nt on CPU0 (caused Data access error on CPU13), errID 0x00074c3c.601593d1

Jan 7 16:49:38 dbprod02 AFSR 0x00000000.01000010<CP> AFAR 0x00000001.c90127f8

Jan 7 16:49:38 dbprod02 AFSR.PSYND 0x0010(Score 95) AFSR.ETS 0x00

Jan 7 16:49:38 dbprod02 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 304836 kern.info] [AFT2] errID 0x00074c3c.601593d1 PA=0x00000001.c90127f8

Jan 7 16:49:38 dbprod02 E$tag 0x00000000.0c403920 E$State: Shared E$parity 0x06

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x00000000.00000020

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000000.00000000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x00000000.00000000 *Bad* PSYND=0x00

10

Jan 7 16:49:38 dbprod02 unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000001.c9012000

Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 929370 kern.info] [AFT3] errID 0x00074c3c.601593d1 Above Error is in User Mo

de

Jan 7 16:49:38 dbprod02 and is fatal: will reboot

Jan 7 16:49:38 dbprod02 unix: [ID 855177 kern.warning] WARNING: [AFT1] initiating reboot due to above error in pid 913 (orac

le)

[6903 byte] By [steved] at [2007-11-25 22:39:48]
# 1

Steve,

Your ecache copyout uncorrectible error on cpu #0 <i><u>could</u></i> have been an isolated event.

<b>Jan 7 16:49:38 dbprod02 SUNW,UltraSPARC-II: [ID 953641 kern.warning] WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU0 (caused Data access error on CPU13), errID 0x00074c3c.601593d1

Jan 7 16:49:38 dbprod02 AFSR 0x00000000.01000010<CP> AFAR 0x00000001.c90127f8

Jan 7 16:49:38 dbprod02 AFSR.PSYND 0x0010(Score 95) AFSR.ETS 0x00</b>

As you appended to your posting in the Developer Forums

<a href="http://forum.sun.com/thread.jspa?threadID=28356&amp;tstart=0" target="_blank"> http://forum.sun.com/thread.jspa?threadID=28356&tstart=0</a>

you thought you found a data write error to an external disk drive,

on or about the same moment in time.

That may not necessarily be the actual cause, but is a valid example of a scenario that can lock up a system to the extent that it will panic the box.

Since there were no application cores and no kernel cores,

you'll need to monitor the system for the near forseeable future.

(maybe as much as two weeks)

That will include putting a dedicated console capture in place.

Replace no hardware at this time.

Bill at 2007-7-5 14:12:55 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 2
Thank you very much for your reply Bill. I will keep monitoring the system. Thanks Again! Steve
steved at 2007-7-5 14:12:55 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 3

DB Server Crashed Again - Different Error's this time.

Nither the system or Oracle created a core file.

Any help would be appreciated!

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 302694 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU4 Data access at TL=0, errID 0x0008305

d.5bfcd9c3

Feb 3 09:06:29 dbprod02AFSR 0x00000000.00200000<UE> AFAR 0x00000001.864b2b38

Feb 3 09:06:29 dbprod02AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10160b66c

Feb 3 09:06:29 dbprod02UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03

Feb 3 09:06:29 dbprod02UDBL Syndrome 0x3 Memory Module Board 0 J3101 J3201 J3301 J3401 J3501 J3601 J3701 J3801

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 594311 kern.warning] WARNING: [AFT1] errID 0x0008305d.5bfcd9c3 Syndrome 0x3 indicates that this may not be a

memory module problem

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 711131 kern.info] [AFT2] errID 0x0008305d.5bfcd9c3 PA=0x00000001.864b2b38

Feb 3 09:06:29 dbprod02E$tag 0x00000000.0e4030c9 E$State: Shared E$parity 0x07

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x393a3539.018002c1

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x02014113.32303036

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x2d30312d.30362032

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x323a3235.3a303801

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x802c000b.04c3051f

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x1304c305.015c04c3

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x051c2301.80133230

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x3035ad31.322d3138 *Bad* PSYND=0x00ff

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 763598 kern.warning] WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU5 (caused Data access e

rror on CPU4), errID 0x0008305d.5bfcd9c3

Feb 3 09:06:29 dbprod02AFSR 0x00000000.01000020<CP> AFAR 0x00000001.864b2b38

Feb 3 09:06:29 dbprod02AFSR.PSYND 0x0020(Score 95) AFSR.ETS 0x00

Feb 3 09:06:29 dbprod02UDBH 0x0098 UDBH.ESYND 0x98 UDBL 0x0000 UDBL.ESYND 0x00

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 711131 kern.info] [AFT2] errID 0x0008305d.5bfcd9c3 PA=0x00000001.864b2b38

Feb 3 09:06:29 dbprod02E$tag 0x00000000.0e4030c9 E$State: Shared E$parity 0x07

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x393a3539.018002c1

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x02014113.32303036

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x2d30312d.30362032

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x323a3235.3a303801

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x802c000b.04c3051f

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x1304c305.015c04c3

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x051c2301.80133230

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x3035ad31.322d3138 *Bad* PSYND=0x0020

Feb 3 09:06:29 dbprod02 unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000001.864b2000

Feb 3 09:06:29 dbprod02 SUNW,UltraSPARC-II: [ID 953114 kern.info] [AFT3] errID 0x0008305d.5bfcd9c3 Above Error is in User Mode

Feb 3 09:06:38 dbprod02 genunix: [ID 672855 kern.notice] syncing file systems...

Feb 3 09:06:39 dbprod02 genunix: [ID 904073 kern.notice] done

steved at 2007-7-5 14:12:55 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 4

Yup, you're right. It's a completely different event.

This time you had an ecache copyout error on cpu5.

"UE CP" == Uncorrectible Event, Copyout

Last time it was on cpu#0.

The only similarities are that the result is expressed the same:

... a reset of the chassis to protect data integrity.

I can only suggest you open a service case with Sun, get through to Kernel Support and have them advise you on how to investigate this.

In my years in the Hardware Support team in Sun, that's all I've ever been able to do for my customers...

Forward them to the engineers in Kernel.

Bill at 2007-7-5 14:12:55 > top of Java-index,Sun Hardware,Other Sun Hardware...