system panic

Hello,

server self rebooted three days back and found this error msgs in the log.

Could anyone please help me in finding the issue.

Server - 420 R , running Solaris 8. Latest patches installed was on July 2005.

$ uname -a

SunOS 5.8 Generic_108528-29 sun4u sparc SUNW,Ultra-80

Error MSGs

########################################

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 714798 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU2 Data access at TL=0, er

rID 0x00050c47.1b5ccd9e

Nov 18 09:47:03AFSR 0x00000001<ME>.80200000<PRIV,UE> AFAR 0x00000000.e6ce5b40

Nov 18 09:47:03AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x101511a4

Nov 18 09:47:03UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x02a0<UE> UDBL.ESYND 0xa0

Nov 18 09:47:03UDBL Syndrome 0xa0 Memory Module U1402 U0402 U1401 U0401

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 838268 kern.info] [AFT2] errID 0x00050c47.1b5ccd9e PA=0x00000000.e6ce5b40

Nov 18 09:47:03E$tag 0x00000000.18c01cd9 E$State: Exclusive E$parity 0x0c

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0xffffffff.ffffffff

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0xffffffff.ffffffff

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0xffffffff.ffffffff

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000000.00000000

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0xffffffff.ffffffff

Nov 18 09:47:03 unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000000.e6ce4000

Nov 18 09:47:03 SUNW,UltraSPARC-II: [ID 781342 kern.info] [AFT3] errID 0x00050c47.1b5ccd9e Above Error is due to Kernel access

Nov 18 09:47:03to User space and is fatal: will reboot

Nov 18 09:47:03 unix: [ID 855177 kern.warning] WARNING: [AFT1] initiating reboot due to above error in pid 18608 (tar)

Nov 18 09:47:05 pseudo: [ID 129642 kern.info] pseudo-device: tod0

Nov 18 09:47:05 genunix: [ID 936769 kern.info] tod0 is /pseudo/<a href="mailto:tod&#64;0" target="_blank">tod@0</a>

Nov 18 09:47:05 syslogd: going down on signal 15

Nov 18 09:47:05 xntpd[274]: [ID 866926 daemon.notice] xntpd exiting on signal 15

Nov 18 09:47:07 unix: [ID 221039 kern.notice] NOTICE: Previously reported error on page 0x00000000.e6ce4000 cleared

Nov 18 09:47:37 genunix: [ID 672855 kern.notice] syncing file systems...

Nov 18 09:47:37 genunix: [ID 904073 kern.notice] done

Nov 18 09:49:34 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-29 64-bit

##########################################

The server rebooted while I was doing a TAR. I am not really able to figure out what the issue is. Is it the CPU or memory issue? after the server came up, I don't see any kind of error msgs and its been running fine since 3 days.

But I am afraid this might happen again..

Do you think there is any serious issue on hardware?

Appreciate your help.

Thanks,

[3808 byte] By [Vani] at [2007-11-25 22:48:00]
«« E450
»» Patches
# 1
1, or more, of these 4 DIMMs had a problem U1402 U0402 U1401 U0401. If you don't have any Corrected Errors in any of your messages file concerning any of these 4 DIMMs then you can only monitor for future events or get that bank replaced.
jds2n at 2007-7-5 17:03:23 > top of Java-index,Sun Hardware,Servers - General Discussion...
# 2

There are a few things that you can do:

1. Check the messages files, or if you record it the servers history, for past events similar to this. If it has any and they are on the same CPU, etc then you are going to have to log a support call with Sun.

2. Make sure the servers OBP firmware is up to date.

3. Make sure the OS kernel, etc patches are up to date.

4. Make sure that you have savecore enabled to capture crash dumps if you are lucky enough to get any.

5. If you have a console that you can connect to the server to log output for a while, it would be worth doing so for a day or so, just incase the server goes again.

6. If you can afford the downtime you can pass the server through diagnostics and SunVTS or similar stress testing.

7. There is also a cediag tool that can be freely downloaded that can help with memory error decoding should you get any of those in the future that you might want to look at if you have not seen it before.

But apart from that unless number 1 above shows more than 1 event on this server on the same CPU, etc, then I would monitor the server for a while under best practise.

stumoor at 2007-7-5 17:03:23 > top of Java-index,Sun Hardware,Servers - General Discussion...