Ultra AXmp - Watchdog Reset

Hi, everyone.

I've got a confounding problem that I am utterly helpless in troubleshooting. I'm not really sure where to begin, so I'll just try to spit everything out and hope something sticks.

The project I work on fields pairs of Ultra AXmp servers with dual UltraSparc IIi 440 CPUs, running Solaris 8. On our first pair, we started to notice "Watchdog Resets" dropping "Server A" down to the "ok" prompt and hanging, at what seemed to be random and widely spaced intervals. "Server B" continued running flawlessly. I did some research and decided (read: guessed) that a bad RAM chip was to blame. The machines were loaded with 8 chips each, and I replaced all of them with the chips from a surplus Ultra which I knew was behaving properly. Lo and behold, a few days later, Server A was Watchdog'd again. So much for my diagnosis.

Because the problem was so intermittent, we were able to ignore it for a while. We've recently stood up our second equipment set, and after replacing Server A wholesale, I installed Solaris 8, patted myself on the back and walked away. It was a job well done until, that's right, the Watchdog bit again.

I've taken down a LOT of diagnostic information, but I don't know hide nor hair about any of it. The first Watchdog on Server A #2 left this on the screen:

<div class="pre"><pre>

panic

Watchdog Reset

Data Access Error

{ 1 } ok

</pre></div>

Not very helpful, I know. So I looked into the messages, and found some lines like this:

<div class="pre"><pre>

Uncorrectable Memory Error on CPU1

Data Access at TL=0

Fault_PC

UDBL Syndrome 0x3 Memory Module

Syndrome 0x3 indicates that this may not be a memory module problem

</pre></div>

I have absolutely no idea what that means. "There is a memory module problem - or is there?"

I have way more information than just that, but like I said, I'm helpless. If it wasn't already obvious I'm not a pro at this, but I'm trying...

Thanks for any help or direction or links or anything at all.

-Tim Nolan

[2495 byte] By [tnolan] at [2007-11-25 22:37:54]
# 1
Oh, and if this is in the wrong forum, then I'd gladly trade a sincere apology for directions toward the correct one. Thanks, -Tim Nolan
tnolan at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 2
Please post the entire /var/adm/messages of the event.. the last couple of known "good" messages up until it starts rebooting please.
jds2n at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 3

Okay, here are some snippets from <div class="pre"><pre>messages</pre></div>. In this first section, something bad happens, the kernel panics, and manages to reboot itself. This isn't the <i>worst</i> thing in the world, but it's definitely not something awesome that I want to keep happening:

<div class="pre"><pre>

Feb 6 18:45:33 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 6 18:50:37 SRV0003G last message repeated 1 time

Feb 6 18:55:41 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 6 19:00:45 SRV0003G last message repeated 1 time

Feb 6 19:05:49 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 6 19:10:53 SRV0003G last message repeated 1 time

Feb 6 19:15:57 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 6 19:21:01 SRV0003G last message repeated 1 time

Feb 6 19:26:05 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 6 19:31:37 SRV0003G unix: [ID 350512 kern.notice] panic: failed to stop cpu2

Feb 6 19:31:38 SRV0003G unix: [ID 836849 kern.notice]

Feb 6 19:31:38 SRV0003G ^Mpanic[cpu1]/thread=2a10007dd20:

Feb 6 19:31:38 SRV0003G unix: [ID 862289 kern.notice] send mondo timeout (target 0x2) [728812 NACK 0 BUSY]

Feb 6 19:31:38 SRV0003G unix: [ID 100000 kern.notice]

Feb 6 19:31:38 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d640 SUNW,UltraSPARC-II:send_one_mondo+b8 (1041b2e0, 1, 1041ad20, 2, 1fd8d7244179, b1eec)

Feb 6 19:31:38 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 00001fd8d7243fe5 0000000000000000 00001fd8d7244215 0000000010009cc8

Feb 6 19:31:38 SRV0003G%l4-7: 00000000fdd09c78 000002a1005ddaf0 0000000000000000 000002a10001f9c0

Feb 6 19:31:38 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d6f0 SUNW,UltraSPARC-II:send_mondo_set+20 (4, 1, 4, 2, 10072824, 60)

Feb 6 19:31:38 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010041c50 0000000000000000 0000000000000000 000002a1000abd20

Feb 6 19:31:38 SRV0003G%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

Feb 6 19:31:38 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d7a0 unix:xt_all+d0 (4, 2, 1041d4bc, 4, 0, 80b)

Feb 6 19:31:38 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010142ae4 0000000000000001 0000000000000001 000000001041b6b8

Feb 6 19:31:38 SRV0003G%l4-7: 000000001041b2f8 0000000000000016 000000001041bab8 000002a10007d7b0

Feb 6 19:31:38 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d850 SUNW,UltraSPARC-II:do_scrub_ecache_line+28 (0, 2a10007dd20, 20, 10, 0, 30000161de0)

Feb 6 19:31:38 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000009900001605 0000000000000016 0000000000000009 0000000010009cc8

Feb 6 19:31:38 SRV0003G%l4-7: 000000001041b2f8 0000000000000016 0000000000000001 000002a10007d7b0

Feb 6 19:31:38 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d900 genunix:callout_execute+90 (bffffffff7e8d350, 1, 30000200038, 772232, 300001ff038, 1)

Feb 6 19:31:39 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010141edc 8000000000000000 000002a10006bd20 00000300002001c8

Feb 6 19:31:39 SRV0003G%l4-7: 0000000000772232 00000300001ff000 000003000157d000 0000000000000000

Feb 6 19:31:39 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d9b0 genunix:softint+70 (104421e0, 30000063928, 104421f0, 104421d8, 10072824, 300001ff000)

Feb 6 19:31:39 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 00000000104434a0 0000000000772228 0000000000000000 00000300001617cc

Feb 6 19:31:39 SRV0003G%l4-7: 00000300000638c8 0000030000a3fea8 0000000000000000 0000030000a3fed0

Feb 6 19:31:39 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007da60 unix:softlevel1+8 (0, 800, 1041b2f8, 2a10007dd20, 10000, 1000e32c)

Feb 6 19:31:39 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000000 0000000000000001 0000000000000000 00000300001617c8

Feb 6 19:31:39 SRV0003G%l4-7: 00000300000638c8 0000030000a3fea8 0000000000000000 0000030000a3fed0

Feb 6 19:31:39 SRV0003G unix: [ID 100000 kern.notice]

Feb 6 19:31:39 SRV0003G genunix: [ID 672855 kern.notice] syncing file systems...

Feb 6 19:31:39 SRV0003G genunix: [ID 904073 kern.notice] done

Feb 6 19:31:40 SRV0003G genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t5d0s1, offset 1678245888

Feb 6 19:31:50 SRV0003G genunix: [ID 409368 kern.notice] ^M100% done: 16714 pages dumped, compression ratio 6.21,

Feb 6 19:31:50 SRV0003G genunix: [ID 851671 kern.notice] dump succeeded

Feb 6 19:33:26 SRV0003G genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-13 64-bit

Feb 6 19:33:26 SRV0003G genunix: [ID 913631 kern.notice] Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved.

</pre></div>

This one is a little different, as you can see by the <div class="pre"><pre>Bad Trap</pre></div> message. This is something that showed up on the Watchdog screen a little later, so I think this is important:

<div class="pre"><pre>

Feb 7 04:39:07 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 04:44:11 SRV0003G last message repeated 1 time

Feb 7 04:49:15 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 04:59:23 SRV0003G last message repeated 2 times

Feb 7 05:04:27 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 05:09:31 SRV0003G last message repeated 1 time

Feb 7 05:14:35 SRV0003G ntpdate[531]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 05:15:39 SRV0003G unix: [ID 836849 kern.notice]

Feb 7 05:15:39 SRV0003G ^Mpanic[cpu2]/thread=2a10006bd20:

Feb 7 05:15:39 SRV0003G unix: [ID 340138 kern.notice] BAD TRAP: type=31 rp=2a10006b8f0 addr=1000008 mmu_fsr=0 occurred in module "genunix" due to an illegal access to a user address

Feb 7 05:15:39 SRV0003G unix: [ID 100000 kern.notice]

Feb 7 05:15:39 SRV0003G unix: [ID 839527 kern.notice] sched:

Feb 7 05:15:39 SRV0003G unix: [ID 520581 kern.notice] trap type = 0x31

Feb 7 05:15:39 SRV0003G unix: [ID 381800 kern.notice] addr=0x1000008

Feb 7 05:15:39 SRV0003G unix: [ID 101969 kern.notice] pid=0, pc=0x100728d8, sp=0x2a10006b191, tstate=0x44f0001605, context=0x0

Feb 7 05:15:39 SRV0003G unix: [ID 743441 kern.notice] g1-g7: 10439000, 60, 3fffffffff622ed9, 0, 0, 0, 2a10006bd20

Feb 7 05:15:39 SRV0003G unix: [ID 100000 kern.notice]

Feb 7 05:15:39 SRV0003G genunix: [ID 723222 kern.notice] 000002a10006b510 unix:die+80 (31, 1000008, 10414f98, 0, 2a10006b8f0, d0726008)

Feb 7 05:15:39 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000008 0000000000000041 0000000010424918 0000000000400000

Feb 7 05:15:39 SRV0003G%l4-7: 00000000000001f4 0000000000000041 000000000016a640 0000000000000000

Feb 7 05:15:40 SRV0003G genunix: [ID 723222 kern.notice] 000002a10006b5f0 unix:trap+8b8 (1000000, 1, 5, 0, 2a10006b8f0, 0)

Feb 7 05:15:40 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000001 00000000780b25a0 0000000010423a00 0000000000000000

Feb 7 05:15:40 SRV0003G%l4-7: 0000000000000031 0000000000000000 0000000000010000 0000000000000000

Feb 7 05:15:40 SRV0003G genunix: [ID 723222 kern.notice] 000002a10006b730 unix:sfmmu_tsb_miss+640 (104286e0, 0, 30000059f88, 0, 30000059f88, 19)

Feb 7 05:15:40 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000001000000 0000000000000004 0000000000000009 000003100003f180

Feb 7 05:15:40 SRV0003G%l4-7: 0000000001000000 0000000000000000 0000000000000000 0000000001000003

Feb 7 05:15:40 SRV0003G genunix: [ID 723222 kern.notice] 000002a10006b840 unix:prom_rtt+0 (0, 1000000, 19, 0, 16, 0)

Feb 7 05:15:40 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000006 0000000000001400 00000044f0001605 0000000010018e3c

Feb 7 05:15:40 SRV0003G%l4-7: 000000004004667f 0000000000000000 0000000000000000 000002a10006b8f0

Feb 7 05:15:40 SRV0003G genunix: [ID 723222 kern.notice] 000002a10006b990 genunix:callout_execute+98 (bfffffffff622f39, 1, 300001f7038, 354e47, 300001f6038, 0)

Feb 7 05:15:40 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 000000001011fd34 8000000000000000 0000000000000009 00000300001f7270

Feb 7 05:15:40 SRV0003G%l4-7: 0000000000354e47 00000300001f6000 0000030003796978 000002a10006b9a0

Feb 7 05:15:40 SRV0003G genunix: [ID 723222 kern.notice] 000002a10006ba40 genunix:taskq_thread+18c (30000a47e08, 0, 10423a00, 10000, 30000a47e3a, 30000a47e60)

Feb 7 05:15:40 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010072824 0000030000a47e38 0000030000a47e30 0000030000a47e08

Feb 7 05:15:40 SRV0003G%l4-7: 0000030000a47e28 0000030000a45ec8 000000001041bc00 000000001041a800

Feb 7 05:15:40 SRV0003G unix: [ID 100000 kern.notice]

Feb 7 05:15:40 SRV0003G genunix: [ID 672855 kern.notice] syncing file systems...

Feb 7 05:15:40 SRV0003G genunix: [ID 904073 kern.notice] done

Feb 7 05:15:41 SRV0003G genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t5d0s1, offset 1678245888

Feb 7 05:15:51 SRV0003G genunix: [ID 409368 kern.notice] ^M100% done: 16714 pages dumped, compression ratio 6.26,

Feb 7 05:15:51 SRV0003G genunix: [ID 851671 kern.notice] dump succeeded

Feb 7 05:17:26 SRV0003G genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-13 64-bit

Feb 7 05:17:26 SRV0003G genunix: [ID 913631 kern.notice] Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved.

</pre></div>

Here's another one:

<div class="pre"><pre>

Feb 7 10:55:37 SRV0003G ntpdate[534]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 11:00:41 SRV0003G last message repeated 1 time

Feb 7 11:05:45 SRV0003G ntpdate[534]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 11:10:49 SRV0003G last message repeated 1 time

Feb 7 11:15:53 SRV0003G ntpdate[534]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 11:20:57 SRV0003G last message repeated 1 time

Feb 7 11:26:01 SRV0003G ntpdate[534]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 11:31:05 SRV0003G last message repeated 1 time

Feb 7 11:32:25 SRV0003G unix: [ID 926934 kern.warning] WARNING: invalid vector intr: number 0x80b, pil 0x0

Feb 7 11:32:30 SRV0003G unix: [ID 350512 kern.notice] panic: failed to stop cpu2

Feb 7 11:32:30 SRV0003G unix: [ID 836849 kern.notice]

Feb 7 11:32:30 SRV0003G ^Mpanic[cpu1]/thread=2a10007dd20:

Feb 7 11:32:30 SRV0003G unix: [ID 862289 kern.notice] send mondo timeout (target 0x2) [728678 NACK 0 BUSY]

Feb 7 11:32:30 SRV0003G unix: [ID 100000 kern.notice]

Feb 7 11:32:30 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d640 SUNW,UltraSPARC-II:send_one_mondo+b8 (1041b2e0, 1, 1041ad20, 2, 934682adb9d, b1e66)

Feb 7 11:32:30 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 00000934682ada97 0000000000000000 00000934682adcc4 000002a10009dd20

Feb 7 11:32:30 SRV0003G%l4-7: 0000000000000000 0000000000000000 0000000000000000 000002a10001f9c0

Feb 7 11:32:30 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d6f0 SUNW,UltraSPARC-II:send_mondo_set+20 (4, 1, 4, 2, 10072824, 60)

Feb 7 11:32:30 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010041c50 0000000000000000 0000000000000000 000002a1000abd20

Feb 7 11:32:30 SRV0003G%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

Feb 7 11:32:30 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d7a0 unix:xt_all+d0 (4, 2, 1041d4bc, 4, 0, 80b)

Feb 7 11:32:30 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010142ae4 0000000000000001 0000000000000001 000000001041b6d8

Feb 7 11:32:30 SRV0003G%l4-7: 000000001041b2f8 0000000000000016 000000001041bab8 0000000000000540

Feb 7 11:32:31 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d850 SUNW,UltraSPARC-II:do_scrub_ecache_line+28 (0, 2a10007dd20, 20, 10, 0, 30000161de0)

Feb 7 11:32:31 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010151e5c 0000000000000002 0000000000000075 0000000000000000

Feb 7 11:32:31 SRV0003G%l4-7: 000000001041b2f8 0000000000000016 000000001041bab8 000002a10007d7b0

Feb 7 11:32:31 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d900 genunix:callout_execute+90 (bffffffffdaaef10, 1, 30000200038, 2254f3, 300001ff038, 1)

Feb 7 11:32:31 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010141edc 8000000000000000 000002a10015dd20 00000300002007d0

Feb 7 11:32:31 SRV0003G%l4-7: 00000000002254f3 00000300001ff000 00000300044514f0 000000001045bfe8

Feb 7 11:32:31 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d9b0 genunix:softint+70 (104421e0, 30000063928, 104421f0, 104421d8, 10072824, 300001ff000)

Feb 7 11:32:31 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010443488 00000000002254e9 0000000000000000 00000300001617c0

Feb 7 11:32:31 SRV0003G%l4-7: 00000300000638c8 0000030000a3fea8 0000000000000000 0000030000a3fed0

Feb 7 11:32:31 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007da60 unix:softlevel1+8 (0, 800, 1041b2f8, 2a10007dd20, 10000, 1000e32c)

Feb 7 11:32:31 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000000 0000000000000001 0000030004598cd0 0000000000000041

Feb 7 11:32:31 SRV0003G%l4-7: 000003000459c9e8 000003000459c968 0000000000000000 0000000000000000

Feb 7 11:32:31 SRV0003G unix: [ID 100000 kern.notice]

Feb 7 11:32:31 SRV0003G genunix: [ID 672855 kern.notice] syncing file systems...

Feb 7 11:32:32 SRV0003G genunix: [ID 904073 kern.notice] done

Feb 7 11:32:33 SRV0003G genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t5d0s1, offset 1678245888

Feb 7 11:32:43 SRV0003G genunix: [ID 409368 kern.notice] ^M100% done: 16703 pages dumped, compression ratio 6.32,

Feb 7 11:32:43 SRV0003G genunix: [ID 851671 kern.notice] dump succeeded

Feb 7 11:34:18 SRV0003G genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-13 64-bit

Feb 7 11:34:18 SRV0003G genunix: [ID 913631 kern.notice] Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved.

</pre></div>

Invalid vector? Mondo timeout? That's bad and weird and everything, but it just gets worse:

<div class="pre"><pre>

Feb 7 12:23:40 SRV0003G ntpdate[533]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 12:33:48 SRV0003G last message repeated 2 times

Feb 7 12:38:52 SRV0003G ntpdate[533]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 12:43:56 SRV0003G last message repeated 1 time

Feb 7 12:49:00 SRV0003G ntpdate[533]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 12:54:04 SRV0003G last message repeated 1 time

Feb 7 12:59:08 SRV0003G ntpdate[533]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 581683 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Data access at TL=0, errID 0x0000050d.87244823

Feb 7 13:05:21 SRV0003GAFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.ae568948

Feb 7 13:05:21 SRV0003GAFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x100729d8

Feb 7 13:05:21 SRV0003GUDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03

Feb 7 13:05:21 SRV0003GUDBL Syndrome 0x3 Memory Module U701-U704,U801-U804

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 120076 kern.warning] WARNING: [AFT1] errID 0x0000050d.87244823 Syndrome 0x3 indicates that this may not be a memory module problem

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 769514 kern.info] [AFT2] errID 0x0000050d.87244823 PA=0x00000000.ae568948

Feb 7 13:05:21 SRV0003GE$tag 0x00000000.0a4015ca E$State: Shared E$parity 0x05

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00085640

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x08): 0x00000000.1111fd34 *Bad* PSYND=0x00ff

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x000002a1.00065d20

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000011.00000040

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000300.017a95e0

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0x00000300.0186acf0

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 902851 kern.warning] WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU2 (caused Data access error on CPU1), errID 0x0000050d.87244823

Feb 7 13:05:21 SRV0003GAFSR 0x00000000.01000008<CP> AFAR 0x00000000.ae568948

Feb 7 13:05:21 SRV0003GAFSR.PSYND 0x0008(Score 95) AFSR.ETS 0x00

Feb 7 13:05:21 SRV0003GUDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 769514 kern.info] [AFT2] errID 0x0000050d.87244823 PA=0x00000000.ae568948

Feb 7 13:05:21 SRV0003GE$tag 0x00000000.1b4015ca E$State: Owner E$parity 0x0d

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00085640

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x08): 0x00000000.1111fd34 *Bad* PSYND=0x0008

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x000002a1.00065d20

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000011.00000040

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000300.017a95e0

Feb 7 13:05:21 SRV0003G SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0x00000300.0186acf0

Feb 7 13:05:21 SRV0003G unix: [ID 836849 kern.notice]

Feb 7 13:05:21 SRV0003G ^Mpanic[cpu1]/thread=2a10007dd20:

Feb 7 13:05:21 SRV0003G unix: [ID 713646 kern.notice] [AFT1] errID 0x0000050d.87244823 UE Error(s)

Feb 7 13:05:21 SRV0003GSee previous message(s) for details

Feb 7 13:05:21 SRV0003G unix: [ID 100000 kern.notice]

Feb 7 13:05:21 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d200 SUNW,UltraSPARC-II:cpu_aflt_log+4e0 (2a10007d2be, 1, 10147b60, 2a10007d448, 2a10007d30b, 10147b88)

Feb 7 13:05:21 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000000 000002a10007d510 0000000000000003 0000000000000010

Feb 7 13:05:21 SRV0003G%l4-7: 0000030004566ae0 0000030004566c48 0000000000000000 000002a10001f9c0

Feb 7 13:05:21 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d450 SUNW,UltraSPARC-II:cpu_async_error+868 (104597b0, 2a10007d510, 80200000, 0, 650180080200000, 2a10007d6d0)

Feb 7 13:05:22 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 000000001040dae4 0000000000000032 0000000000000203 0000000000000000

Feb 7 13:05:22 SRV0003G%l4-7: 00000000ae568940 0000000000400000 0000000000400000 0000000000000001

Feb 7 13:05:22 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d620 unix:prom_rtt+0 (85640, 300001f7038, 8000000000000000, 85640, 30004524918, 10141edc)

Feb 7 13:05:22 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000000 0000000000001400 0000000000001607 000000001013f814

Feb 7 13:05:22 SRV0003G%l4-7: 000000001041b2f8 0000000000000016 000000000000000a 000002a10007d6d0

Feb 7 13:05:22 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d770 genunix:callout_schedule_1+4 (300001f6000, 300001f6000, 20, 10072824, 0, 30000161de0)

Feb 7 13:05:22 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000008 0000000000000002 0000000000000001 000000001041b6c8

Feb 7 13:05:22 SRV0003G%l4-7: 000000001041b2f8 0000000000000016 000000001041bab8 0000000000000000

Feb 7 13:05:22 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d820 genunix:callout_schedule+54 (10439194, 1, 10439110, 8, 2, 30000161de0)

Feb 7 13:05:22 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000010141edc 8000000000000000 000002a10000fd20 0000030000200230

Feb 7 13:05:22 SRV0003G%l4-7: 0000000000085640 00000300001ff000 0000000000000000 000002a100457ba0

Feb 7 13:05:22 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d8d0 genunix:clock+474 (1045a400, 1041b2f8, 1042dc00, 24374214417, 0, 0)

Feb 7 13:05:22 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000000000000000 0000000000000001 000002a10000fd20 0000000000000000

Feb 7 13:05:22 SRV0003G%l4-7: 0000000010459c00 000000003b9aca00 000000001041bab8 0000030000a3fed0

Feb 7 13:05:22 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007d9a0 genunix:cyclic_softint+a4 (1041b2f8, 30000063928, 1, 3, 300001617c0, 10073a3c)

Feb 7 13:05:23 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 0000030000063930 0000000000085635 0000000000000000 00000300001617c4

Feb 7 13:05:23 SRV0003G%l4-7: 00000300000638c8 0000030000a3fea8 0000000010031fdc 0000030000a3fed0

Feb 7 13:05:23 SRV0003G genunix: [ID 723222 kern.notice] 000002a10007da60 unix:cbe_level10+8 (0, 803, 1041b2f8, 2a10007dd20, 10060, 1000b34c)

Feb 7 13:05:23 SRV0003G genunix: [ID 179002 kern.notice]%l0-3: 00000044f0001602 0000000000000001 0000000000000001 0000000010009cc8

Feb 7 13:05:23 SRV0003G%l4-7: 000000000019f5e0 0000000000000000 0000000000000000 000002a10001f910

Feb 7 13:05:23 SRV0003G unix: [ID 100000 kern.notice]

Feb 7 13:05:23 SRV0003G genunix: [ID 672855 kern.notice] syncing file systems...

Feb 7 13:05:23 SRV0003G genunix: [ID 904073 kern.notice] done

Feb 7 13:05:24 SRV0003G genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t5d0s1, offset 1678245888

Feb 7 13:05:46 SRV0003G genunix: [ID 409368 kern.notice] ^M100% done: 16727 pages dumped, compression ratio 6.31,

Feb 7 13:05:46 SRV0003G genunix: [ID 851671 kern.notice] dump succeeded

Feb 7 13:07:22 SRV0003G genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-13 64-bit

Feb 7 13:07:22 SRV0003G genunix: [ID 913631 kern.notice] Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved.

</pre></div>

And here's another one:

<div class="pre"><pre>

Feb 8 16:00:09 SRV0003G ntpdate[533]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 8 16:05:13 SRV0003G last message repeated 1 time

Feb 8 16:10:17 SRV0003G ntpdate[533]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 8 16:20:25 SRV0003G last message repeated 2 times

Feb 8 16:25:29 SRV0003G ntpdate[533]: [ID 398266 daemon.notice] waiting 300 seconds before trying again

Feb 8 16:28:45 SRV0003G unix: [ID 836849 kern.notice]

Feb 8 16:28:45 SRV0003G ^Mpanic[cpu2]/thread=300015f1620:

Feb 8 16:28:45 SRV0003G unix: [ID 340138 kern.notice] BAD TRAP: type=31 rp=2a1000eb5c0 addr=1000198 mmu_fsr=0 occurred in module "genunix" due to an illegal access to a user address

Feb 8 16:28:45 SRV0003G unix: [ID 100000 kern.notice]

Feb 8 16:28:45 SRV0003G unix: [ID 839527 kern.notice] fsflush:

Feb 8 16:28:45 SRV0003G unix: [ID 836849 kern.notice]

Feb 8 16:28:45 SRV0003G ^Mpanic[cpu2]/thread=300015f1620:

Feb 8 16:28:45 SRV0003G unix: [ID 836849 kern.notice]

Feb 8 16:28:45 SRV0003G ^Mpanic[cpu2]/thread=300015f1620:

Feb 8 16:28:45 SRV0003G unix: [ID 836849 kern.notice]

Feb 8 16:28:45 SRV0003G ^Mpanic[cpu2]/thread=300015f1620:

Feb 8 16:28:45 SRV0003G unix; [ID 799565 kern.notice] BAD TRAP: type=31 rp=10420b20 addr=f10298ec mmu_fsr=0

Feb 8 16:28:45 SRV0003G unix: [ID 100000 kern.notice]

Feb 8 16:28:45 SRV0003G genunix: [ID 672855 kern.notice] syncing file systems...

Feb 8 16:28:45 SRV0003G genunix: [ID 904073 kern.notice] done

Feb 8 16:28:46 SRV0003G genunix: [ID 353387 kern.notice] dumping to /dev/dsk/c0t5d0s1, offset 1678245888

Feb 8 16:28:49 SRV0003G genunix: [ID 596671 kern.warning] WARNING: invalid segment 1041f5d0 in address space 10423910

Feb 8 16:28:52 SRV0003G genunix: [ID 409368 kern.notice] ^M100% done: 6786 pages dumped, compression ratio 6.83,

Feb 8 16:28:52 SRV0003G genunix: [ID 851671 kern.notice] dump succeeded

Feb 8 16:30:26 SRV0003G genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-13 64-bit

Feb 8 16:30:26 SRV0003G genunix: [ID 913631 kern.notice] Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved.

</pre></div>

There are more, but this is already a flood of information that I'm sure no one wants to parse. I cannot find any messages from the last crash, when <div class="pre"><pre>dump</pre></div> failed and the machine refused to reboot, but I did copy everything down by hand and can post that as well.

Thanks so very much to anyone who's bothering to read these posts!

-Tim Nolan

tnolan at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 4

Here is what was on the screen after the fatal Watchdog:

<div class="pre"><pre>

panic [cpu2]/thread = 300018944480: BAD TRAP: type=31 rp=1041e1b0 addr=26765742025 mmu_fsr=0

dump aborted: please record the above information

panic [cpu2]/thread = 300018944480: BAD TRAP: type=31 rp=1041da70 addr=100000000 mmu_fsr=0

dump aborted: please record the above information

Watchdog Reset

Externally Initiated Reset

{ 2 } ok

</pre></div>

I know that this is odd because it says {2} at the "ok" instead of {1}. From what I can tell, the number in the squigglies is sort of a depth indicator, so this prompt would be inside the main "ok." So why did this drop down to a new "ok" prompt instead of killing everything and kicking back out to the original?

tnolan at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 5

If it were me, I would seriously consider getting CPU2 out of there.

"SunOS Release 5.8 Version Generic_108528-13" indicates you're about 40 revisions behind which is about current with 3 years ago. If it were just a patch issues then CPU1 should have had a similar share of "events" which it hasn't.

Feb 6 19:31:37

panic: failed to stop cpu2

Feb 7 11:32:30

panic: failed to stop cpu2

Feb 7 13:05:21

CP event on CPU2 (caused Data access error on CPU1)

(Score 95)

jds2n at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 6
It is not a "depth" indicator. It indicates which cpu is currently the "boot" cpu in a multi processor system. 2 more panics from cpu2....
jds2n at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 7

I would love to blame the faults on hardwarde, however these machines are pretty locked-down right now, and doing non-trivial hardware changes will require cutting through a swath of red tape. Do you know if there's any way to put these machines into a frenzied state such that a panic is more likely? I've been playing the Waiting Game for a few days now with no more events, after a few days with several, so it could be hard to prove that removing CPU2 actually solved the problem rather than just happened to to occur during a "healthy" stretch.

Oh, I've also heard that a lot of the UltraSPARC IIi chips were/are bogus. Is that just a rumor, or is there a chance I've gotten some (ie, more than one) bums?

tnolan at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 8

"frenzied state" ? SunVTS perhaps. Don't know if there is a version for Ultra AXmp. Will bring a production system to it's knees so be careful. While you are in a healty stretch you may want to get a core file analyzed by opening up a ticket with Sun. There should be a few in /var/crash/SRV0003G.

If these panics are on different systems then what is the same? The disk? The rack the cleaners plug the floor buffer into? Time for some sleuthing.

"I've also heard that a lot of the UltraSPARC IIi chips were/are bogus. Is that just a rumor, or is there a chance I've gotten some"...

I can't / won't speak to this as it is not my place to do so. It seems to me there would be a Sun Alert if there was a known issue. How long have these systems worked without problems?

jds2n at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 9
Hello Tim, the SPARCengine AXmp uses UltraSPARC II or UltraSPARC I (like the Ultra 2, Ultra 30, E450, ...). The UltraSPARC IIi/IIe are completely different. Michael
maal at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 10
Michael, You're absolutely right, they are UltraSPARC II's, not IIi's. My mistake. Thanks, -Tim Nolan
tnolan at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 11

Since all of the machines have removable hard drives, I touched /reconfigure and swapped between Servers A and B. "A" has been behaving as the primary DNS and the the primary CORBA Name Server. The DNS is set up master-master, the CORBA Name Servers are set up in a redundant pair (calls made to one are repeated to the other).

So far, there have been no problems, despite regular (semi-heavy) use. Traffic and loads should also increase to heavy in the next few days.

I'm not quite sure what to hope for. If "B" crashes from the faults that nailed "A," there are a couple of possibilities, right?

<ol type="1">

<li>Kernel needs patching (likely)

<li>Both A and B CPU2's are bad (unlikely)

<li>Software can't handle primary duties (hope not!)

</ol>

If "A" still fails, then we definitely have a hardware fault, and I'll yank CPU2.

If nothing fails, then what? Does that suggest that CPU2 on A is bad, but good enough to handle the slightly lighter secondary load?

Thanks,

Tim Nolan

tnolan at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...
# 12

Hmm fairly easy..

1. Patch the OS as its whey low ( recommended cluster from sunsolve)

2. update OBP firmware ( download form sunsolve)

2a test box

3. remove CPU2 and run box if all ok remove cpu1 and refit cpu 2 into cpu1 position and run box. if fails then replace cpu

4. if ok put origional cpu1 in cpu position and runbox. if fails replace cpu

If its still failing then you may have a main board issue but it looks like cpu is going faulty.

theengineerwhogotaway at 2007-7-5 14:07:09 > top of Java-index,Sun Hardware,Other Sun Hardware...