E450 crash - uncorrectable memory error
Hi All,
we've got an E450 server which crashed yesterday with the following messages:
///////////////
WARNING: [AFT1] Uncorrectable Memory Error on CPU3 Data access at TL=0, errID 0x
000405b6.816c859e
AFSR 0x00000000.80200000 AFAR 0x00000000.7649f528
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x100e02c0
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203 UDBL.ESYND 0x03
UDBL Syndrome 0x3 Memory Module 190x
WARNING: [AFT1] errID 0x000405b6.816c859e Syndrome 0x3 indicates that this may n
ot be a memory module problem
WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU2 (caused Data a
ccess error on CPU3), errID 0x000405b6.816c859e
AFSR 0x00000000.01000080 AFAR 0x00000000.7649f528
AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
panic[cpu3]/thread=3000f348e00: [AFT1] errID 0x000405b6.816c859e UE Error(s)
See previous message(s) for details
000002a10207ee40 SUNW,UltraSPARC-II:cpu_aflt_log+568 (2a10207eefe, 1, 101558c0,
2a10207f088, 2a10207ef4b, 101558e8)
%l0-3: 0000000000000000 0000000000000003 000002a10207f150 0000000000000010
%l4-7: 0000030010091e40 0000030010091dc0 000002a10207faec 000003001f5c4828
000002a10207f090 SUNW,UltraSPARC-II:cpu_async_error+868 (1046a630, 2a10207f150,
80200000, 0, 650180080200000, 2a10207f310)
%l0-3: 00000000104750d8 0000000000000032 0000000000000203 0000000000000000
%l4-7: 000000007649f500 0000000000400000 0000000000400000 0000000000000001
000002a10207f260 unix:prom_rtt+0 (300207d0400, 3000f348e00, 20, 0, 0, 0)
%l0-3: 0000000000000007 0000000000001400 0000000000001606 000000001014ce08
%l4-7: 0000000010434738 0000000000000000 0000000000000000 000002a10207f310
000002a10207f3b0 genunix:kmem_cache_alloc+3c0 (300207d0400, 30004e90500, 2a10207
f730, 1, 1, 30008b11ce0)
%l0-3: 000003000007ad40 0000000000000040 0000000000000000 0000000000000000
%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
000002a10207f520 tcp:tcp_wrw+2c (2a10207f730, 30004e90500, 300207d0400, 0, 0, ff
bef85c)
%l0-3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
%l4-7: 0000000000000001 0000000000000000 0000000000000000 00000000fecbc008
000002a10207f5d0 genunix:rwnext+23c (300207d0468, 300207d0528, 0, 300207d0400, 2
a10207f730, 7840d7c0)
%l0-3: 000003001f5c4908 00000300207d04e0 0000030004e90500 000002a10207fa00
%l4-7: 000000000000006c 0000000000000000 000002a10207f868 0000000000000000
000002a10207f680 genunix:strput+38c (0, 2a10207fa00, 3001f5c4908, 8, 0, 0)
%l0-3: 000002a10207f930 0000000000000000 00000000ffbef930 0000000000000000
%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
000002a10207f870 genunix:strwrite+200 (850, 2a10207f930, 300012e7508, 1000000, 3
0020765bc8, 2a10207fa00)
%l0-3: 0000030010091dc0 0000000000000b68 000003001f5c4908 0000000000000083
%l4-7: 0000000000000001 0000030010091e40 0000000000000000 0000000000000000
000002a10207f940 genunix:write+204 (7d330, 40, 83, 3001f942048, 6, 40)
%l0-3: 00000000783e87ac 0000000000000040 0000030020765bc8 0000000000000000
%l4-7: 0000030010091e40 0000030010091dc0 000002a10207faec 000003001f5c4828
000002a10207fa40 genunix:write32+30 (6, 42078, 40, 0, 0, 0)
%l0-3: 0000030011832ab8 0000000000000006 00000000ffbef804 0000000000100083
%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
syncing file systems... 1 done
dumping to /dev/md/dsk/d20, offset 859373568
100% done: 136251 pages dumped, compression ratio 3.06, dump succeeded
rebooting...
Resetting ...
Software Power ON
CPU0 has assumed the role of Boot CPU
@(#) Sun Ultra 450 3.16 Version 2 created 2000/01/11 15:42
Online: CPU0 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7
Online: CPU1 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7
Online: CPU2 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7
Online: CPU3 Ultra-II (v10.0) 4:1 4096KB 2-2 ECache MCap 7
Motherboard DTAG SRAMs support up to 8192KB of ECache per CPU Module
Setting system ECache size to 4096KB
Clearing DTAGS...Done
Auxio Level = 0000.0000.0000.0004
Clearing E-Cache Tags...Done
Clearing I/D TLBs...Done
Probing Memory...Done
HiMem base = 0000.0000.0000.0000size = 0000.0001.0000.0000
Clearing Memory...Done
MMUs ON
Copying ROM to RAM...Done
RAM CRC = 0000.0000.d28e.364f; ROM CRC = 0000.0000.d28e.364f
Decompressing into Memory...0000.0000.0004.47d0 (274KB)...Done
Size = 0000.0000.0008.3930 (527KB)
Starting Forth kernel at 0000.0000.f005.8c5c
//////////////////////////////////////////
It seems for me first it's a memory failure but when I read second time I saw "[AFT1] errID 0x000405b6.816c859e Syndrome 0x3 indicates that this may not be a memory module problem"
What may cause this bug? Could anybody help me? Is the CPU3 failed?
Thx a lot for answares
Joseph

