Server Crashed today!!! Need Help

I have an E4500 that crashed today and when it came backup it wanted to run fsck on /opt.

I read the messages file but i am not sure what it is telling me.

Here is the error logs.

Oct 13 16:36:03 myserver SUNW,UltraSPARC-II: [ID 632919 kern.warning] WARNING: [AFT1] WP event on CPU8, errID 0x0045d2a9.9c0

cc094

Oct 13 16:36:03 myserverAFSR 0x00000000.00800008<WP> AFAR 0x0000007f.5dd9def0

Oct 13 16:36:03 myserverAFSR.PSYND 0x0008(Score 95) AFSR.ETS 0x00 Fault_PC 0xfb50e9cc

Oct 13 16:36:03 myserverUDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 815649 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU5 Dat

a access at TL=0, errID 0x0045d2ab.fb9dfacf

Oct 13 16:36:13 myserverAFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.fe8a9d58

Oct 13 16:36:13 myserverAFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10025328

Oct 13 16:36:13 myserverUDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03

Oct 13 16:36:13 myserverUDBL Syndrome 0x3 Memory Module Board 2 J3100 J3200 J3300 J3400 J3500 J3600 J3700 J3800

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 529255 kern.warning] WARNING: [AFT1] errID 0x0045d2ab.fb9dfacf Syndrome 0x3

indicates that this may not be a memory module problem

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 790240 kern.info] [AFT2] errID 0x0045d2ab.fb9dfacf PA=0x00000000.fe8a9d58

Oct 13 16:36:13 myserverE$tag 0x00000000.1ac01fd1 E$State: Exclusive E$parity 0x0d

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x18): 0x00000000.10000000 *Bad* PSYND=0x0

0ff

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000001.f7000670

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000001.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0x00000001.f7027ff8

Oct 13 16:36:13 myserver unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000000.fe8a8000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 265390 kern.info] [AFT3] errID 0x0045d2ab.fb9dfacf Above Error detected by

protected Kernel code

Oct 13 16:36:13 myserverthat will try to clear error from system

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 916683 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU5 Dat

a access at TL=0, errID 0x0045d2ab.fddc38ed

Oct 13 16:36:13 myserverAFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.fe8a9d58

Oct 13 16:36:13 myserverAFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10025328

Oct 13 16:36:13 myserverUDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03

Oct 13 16:36:13 myserverUDBL Syndrome 0x3 Memory Module Board 2 J3100 J3200 J3300 J3400 J3500 J3600 J3700 J3800

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 374472 kern.warning] WARNING: [AFT1] errID 0x0045d2ab.fddc38ed Syndrome 0x3

indicates that this may not be a memory module problem

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 732702 kern.info] [AFT2] errID 0x0045d2ab.fddc38ed PA=0x00000000.fe8a9d58

Oct 13 16:36:13 myserverE$tag 0x00000000.1ac01fd1 E$State: Exclusive E$parity 0x0d

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x18): 0x00000000.10000000 *Bad* PSYND=0x0

0ff

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000001.f7000670

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000001.00000000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0x00000001.f7027ff8

Oct 13 16:36:13 myserver unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000000.fe8a8000

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 341280 kern.info] [AFT3] errID 0x0045d2ab.fddc38ed Above Error detected by

protected Kernel code

Oct 13 16:36:13 myserverthat will try to clear error from system

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 621182 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU4 Dat

a access at TL=0, errID 0x0045d2b1.ba4942ce

Oct 13 16:36:38 myserverAFSR 0x00000000.00200000<UE> AFAR 0x00000000.fe8a9d58

Oct 13 16:36:38 myserverAFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xfb800004

Oct 13 16:36:38 myserverUDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03

Oct 13 16:36:38 myserverUDBL Syndrome 0x3 Memory Module Board 2 J3100 J3200 J3300 J3400 J3500 J3600 J3700 J3800

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 366302 kern.warning] WARNING: [AFT1] errID 0x0045d2b1.ba4942ce Syndrome 0x3

indicates that this may not be a memory module problem

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 356728 kern.info] [AFT2] errID 0x0045d2b1.ba4942ce PA=0x00000000.fe8a9d58

Oct 13 16:36:38 myserverE$tag 0x00000000.0bc01fd1 E$State: Modified E$parity 0x05

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x18): 0x00000000.10000000 *Bad* PSYND=0x0

0ff

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000001.f7000670

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000001.00000000

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0x00000001.f7027ff8

Oct 13 16:36:38 myserver unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000000.fe8a8000

Oct 13 16:36:38 myserver SUNW,UltraSPARC-II: [ID 507105 kern.info] [AFT3] errID 0x0045d2b1.ba4942ce Above Error is in User M

ode

Oct 13 16:36:38 myserverand is fatal: will reboot

Oct 13 16:36:38 myserver unix: [ID 855177 kern.warning] WARNING: [AFT1] initiating reboot due to above error in pid 28250 (j

ava)

Message was edited by:

100mbs

[7638 byte] By [100mbs] at [2007-11-26 10:47:40]
# 1

These forums are NOT Sun technical support.

They are hosted on the Internet so that end-users can

discuss general topics and share their experiences.

cpu #8 in your E4500 had a problem with stale data in its ecache.

It had difficulty while comparing the ECC checksum validity as it

wrote the data to ecache.(ecache write-parity, a.k.a. E$ WP )

Solaris had problems when it tried to clear that stale data, for whatever reason.

So, when software methods were inadequate to maintain the system

and not corrupt all data that was passing through it, then Solaris did

exactly as it is designed to do in such circumstances:

reboot the system to clear everything an a brute-force fashion.

I suggest you use your service contract and open a support case with Sun.

Have your case forwarded to the Kernel Support Team.

System crash events are not supported by a Hardware team.

A Kernel Engineer will analyze the system's core files and give proper advice.

You do have corefiles saved in /var/crash/<hostname> I hope ?

If no corefiles, there is no way to get any answers.

=======

Edit:

you posted the same question to another website's forum.

http://www.tek-tips.com/viewthread.cfm?qid=1289869&page=1

the first two responses over there were inaccurate.

Your excerpts from your logs specifically say ...

Oct 13 16:36:03 myserverAFSR.PSYND 0x0008(Score 95) AFSR.ETS 0x00 Fault_PC 0xfb50e9cc

for cpu #8 which specifies a 95% probability of where the event took place, and the logs also show:

Oct 13 16:36:13 myserver SUNW,UltraSPARC-II: [ID 529255 kern.warning] WARNING: [AFT1] errID 0x0045d2ab.fb9dfacf Syndrome 0x3 indicates that this may not be a memory module problem

... and it wasn't a RAM problem, this was an ecache event.

Again, go log a support case with Sun.

rukbat at 2007-7-7 2:59:55 > top of Java-index,Sun Hardware,Servers - General Discussion...