E450 crash - uncorrectable memory error

Hi All,

we've got an E450 server which crashed yesterday with the following messages:

///////////////

WARNING: [AFT1] Uncorrectable Memory Error on CPU3 Data access at TL=0, errID 0x

000405b6.816c859e

AFSR 0x00000000.80200000 AFAR 0x00000000.7649f528

AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x100e02c0

UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203 UDBL.ESYND 0x03

UDBL Syndrome 0x3 Memory Module 190x

WARNING: [AFT1] errID 0x000405b6.816c859e Syndrome 0x3 indicates that this may n

ot be a memory module problem

WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU2 (caused Data a

ccess error on CPU3), errID 0x000405b6.816c859e

AFSR 0x00000000.01000080 AFAR 0x00000000.7649f528

AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00

UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00

panic[cpu3]/thread=3000f348e00: [AFT1] errID 0x000405b6.816c859e UE Error(s)

See previous message(s) for details

000002a10207ee40 SUNW,UltraSPARC-II:cpu_aflt_log+568 (2a10207eefe, 1, 101558c0,

2a10207f088, 2a10207ef4b, 101558e8)

%l0-3: 0000000000000000 0000000000000003 000002a10207f150 0000000000000010

%l4-7: 0000030010091e40 0000030010091dc0 000002a10207faec 000003001f5c4828

000002a10207f090 SUNW,UltraSPARC-II:cpu_async_error+868 (1046a630, 2a10207f150,

80200000, 0, 650180080200000, 2a10207f310)

%l0-3: 00000000104750d8 0000000000000032 0000000000000203 0000000000000000

%l4-7: 000000007649f500 0000000000400000 0000000000400000 0000000000000001

000002a10207f260 unix:prom_rtt+0 (300207d0400, 3000f348e00, 20, 0, 0, 0)

%l0-3: 0000000000000007 0000000000001400 0000000000001606 000000001014ce08

%l4-7: 0000000010434738 0000000000000000 0000000000000000 000002a10207f310

000002a10207f3b0 genunix:kmem_cache_alloc+3c0 (300207d0400, 30004e90500, 2a10207

f730, 1, 1, 30008b11ce0)

%l0-3: 000003000007ad40 0000000000000040 0000000000000000 0000000000000000

%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

000002a10207f520 tcp:tcp_wrw+2c (2a10207f730, 30004e90500, 300207d0400, 0, 0, ff

bef85c)

%l0-3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

%l4-7: 0000000000000001 0000000000000000 0000000000000000 00000000fecbc008

000002a10207f5d0 genunix:rwnext+23c (300207d0468, 300207d0528, 0, 300207d0400, 2

a10207f730, 7840d7c0)

%l0-3: 000003001f5c4908 00000300207d04e0 0000030004e90500 000002a10207fa00

%l4-7: 000000000000006c 0000000000000000 000002a10207f868 0000000000000000

000002a10207f680 genunix:strput+38c (0, 2a10207fa00, 3001f5c4908, 8, 0, 0)

%l0-3: 000002a10207f930 0000000000000000 00000000ffbef930 0000000000000000

%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

000002a10207f870 genunix:strwrite+200 (850, 2a10207f930, 300012e7508, 1000000, 3

0020765bc8, 2a10207fa00)

%l0-3: 0000030010091dc0 0000000000000b68 000003001f5c4908 0000000000000083

%l4-7: 0000000000000001 0000030010091e40 0000000000000000 0000000000000000

000002a10207f940 genunix:write+204 (7d330, 40, 83, 3001f942048, 6, 40)

%l0-3: 00000000783e87ac 0000000000000040 0000030020765bc8 0000000000000000

%l4-7: 0000030010091e40 0000030010091dc0 000002a10207faec 000003001f5c4828

000002a10207fa40 genunix:write32+30 (6, 42078, 40, 0, 0, 0)

%l0-3: 0000030011832ab8 0000000000000006 00000000ffbef804 0000000000100083

%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

syncing file systems... 1 done

dumping to /dev/md/dsk/d20, offset 859373568

100% done: 136251 pages dumped, compression ratio 3.06, dump succeeded

rebooting...

Resetting ...

Software Power ON

CPU0 has assumed the role of Boot CPU

@(#) Sun Ultra 450 3.16 Version 2 created 2000/01/11 15:42

Online: CPU0 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7

Online: CPU1 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7

Online: CPU2 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7

Online: CPU3 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7

Motherboard DTAG SRAMs support up to 8192KB of ECache per CPU Module

Setting system ECache size to 4096KB

Clearing DTAGS...Done

Auxio Level = 0000.0000.0000.0004

Clearing E-Cache Tags...Done

Clearing I/D TLBs...Done

Probing Memory...Done

HiMem base = 0000.0000.0000.0000size = 0000.0001.0000.0000

Clearing Memory...Done

MMUs ON

Copying ROM to RAM...Done

RAM CRC = 0000.0000.d28e.364f; ROM CRC = 0000.0000.d28e.364f

Decompressing into Memory...0000.0000.0004.47d0 (274KB)...Done

Size = 0000.0000.0008.3930 (527KB)

Starting Forth kernel at 0000.0000.f005.8c5c

//////////////////////////////////////////

It seems for me first it's a memory failure but when I read second time I saw "[AFT1] errID 0x000405b6.816c859e Syndrome 0x3 indicates that this may not be a memory module problem"

What may cause this bug? Could anybody help me? Is the CPU3 failed?

Thx a lot for answares

Joseph

[5294 byte] By [Jocia] at [2007-11-27 0:14:14]
# 1

Indeed, if you see " UDBL Syndrome 0x3" it's never a DIMM issue.

Your error message told you what actually happened:

"UE report, CP event on CPU2 (caused Data access error on CPU3), .........(Score 95)"

which is its way of noting a 95% confidence level of point-of-error.

(temporarily corrupt data bits in an ecache copyout transaction from CPU2)

Your OBP is at 3.16, which is a prehistoric version.

Your Solaris kernel patch level is likely just as old.

You're missing out on a myriad of error trapping and self-healing, from newer versions.

If you can boot the system and this is the only time such an event occurred,

then just ignore the event.Proceed with patching everything, firmware and operating environment.

If the system will not boot, then remove CPU2.

... then patch everything, firmware and operating environment.

rukbata at 2007-7-11 21:59:14 > top of Java-index,Sun Hardware,Servers - General Discussion...