memory problem on my e3500
Hi all,
I've a problem on this e3500 server, I had several reboot without printing anything in messages.
Now I found something, I think it's not cpu19 involved (score05 and syndrome not equal to 0x3), I suppose it's fault of 2 memory slot on board 7 or dimms. Nothing was evidencied by advanced POST.
Now the question is: How can I find the physical address of the bad dimms ( Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x10): 0x696cf36f.6e74726f *Bad* PSYND=0xff00) ? is possible to translate the hex code and find the J3*** number? Is there a table or a doc where I can find the answer? Why Oracle pid is involved in this case? Maybe only because that pid was unequal to parity alg?
Thank you in advance
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 949434 kern.warning] WARNING:
[AFT1] Uncorrectable Memory Error on CPU19 Data access at TL=0, err
ID 0x0000e56e.7c3643da
Nov 13 05:32:57 rheaAFSR 0x00000001<ME>.00300000<UE,CE> AFAR
0x00000000.8b212380
Nov 13 05:32:57 rheaAFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0xffffffff7d000970
Nov 13 05:32:57 rheaUDBH 0x029c<UE> UDBH.ESYND 0x9c UDBL 0x0333<UE,CE>
UDBL.ESYND 0x33
Nov 13 05:32:57 rheaUDBH Syndrome 0x9c Memory Module Board 7 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 549381 kern.info] [AFT2] errID
0x0000e56e.7c3643da PA=0x00000000.8b212380
Nov 13 05:32:57 rheaE$tag 0x00000000.1cc01164 E$State: Exclusive
E$parity 0x0e
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x00): 0x060337ff.01800180
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2]
E$Data
(0x08): 0xffff3100.1c746578 *Bad* PSYND=0x00ff
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2]
E$Data
(0x10): 0x696cf36f.6e74726f *Bad* PSYND=0xff00
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x18): 0x6c736e63.31407669
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x20): 0x7267696c.696f2e69
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x28): 0x74ff0180.01800180
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x30): 0x02c10201.80013001
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x38): 0x30018009.42393935
Nov 13 05:32:57 rhea unix: [ID 321153 kern.notice] NOTICE: Scheduling
clearing of error on page 0x00000000.8b212000
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 512463 kern.info] [AFT3] errID
0x0000e56e.7c3643da Above Error is in User Mode
Nov 13 05:32:57 rheaand is fatal: will reboot
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 820260 kern.warning] WARNING:
[AFT1] Uncorrectable Memory Error on CPU19 Data access at TL=0, err
ID 0x0000e56e.7c3643da
Nov 13 05:32:57 rheaAFSR 0x00000001<ME>.00300000<UE,CE> AFAR
0x00000000.8b212380
Nov 13 05:32:57 rheaAFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0xffffffff7d000970
Nov 13 05:32:57 rheaUDBH 0x029c<UE> UDBH.ESYND 0x9c UDBL 0x0333<UE,CE>
UDBL.ESYND 0x33
Nov 13 05:32:57 rheaUDBL Syndrome 0x33 Memory Module Board 7 J3101
J3201 J3301 J3401 J3501 J3601 J3701 J3801
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 549381 kern.info] [AFT2] errID
0x0000e56e.7c3643da PA=0x00000000.8b212380
Nov 13 05:32:57 rheaE$tag 0x00000000.1cc01164 E$State: Exclusive
E$parity 0x0e
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x00): 0x060337ff.01800180
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2]
E$Data
(0x08): 0xffff3100.1c746578 *Bad* PSYND=0x00ff
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2]
E$Data
(0x10): 0x696cf36f.6e74726f *Bad* PSYND=0xff00
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x18): 0x6c736e63.31407669
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x20): 0x7267696c.696f2e69
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x28): 0x74ff0180.01800180
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x30): 0x02c10201.80013001
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2]
E$Data
(0x38): 0x30018009.42393935
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 512463 kern.info] [AFT3] errID
0x0000e56e.7c3643da Above Error is in User Mode
Nov 13 05:32:57 rheaand is fatal: will reboot
Nov 13 05:32:57 rhea unix: [ID 855177 kern.warning] WARNING: [AFT1]
initiating reboot due to above error in pid 19609 (oracle)
[5551 byte] By [
simone79] at [2007-11-25 22:47:54]

# 1
Have you run the system through extended POST?
# 2
Yes I wrote it on my message, I executed extended POST but nothing was evidenced.Any idea?Can you help me translating the hex code?Thank you 4 attention and 4give my bad english.
# 3
Simone,
I tend to agree with you that it's not a cpu issue.
Cpu19 just happened to be the lump of hardware that was capable enough to initiate the entries into MESSAGES.
cpu19 is on board #9, but the eight DIMMs in bank #1 of board #7 were called out.
Your excerpt from that log shows that the system had an uncorrectible event (UE) that was due to a multi-bit ECC error (ME) --the correctibles eventually became uncorrectible.
Your Solaris kernel attempted to retire the memory page out of active use, but that address space was in use by Oracle and could not be modified.
The Solaris kernel had no alternative but to do what it is designed to do:
protect against possible data corruption and panic the system so that it reboots.
--
Extended POST would need to be run at least three times in a row, with a power-down in between each, for a proper examination.
Also, consider the following -- when was the last time that anyone needed to replace or install any RAM to board #7 ?
Perhaps it the RAM just needs to be removed/replaced to the exact same slots, which would burnish all connections that have become oxidized over time.
Consider doing it to all of the DIMMs on that board #7.
If time permits, do it to every memory module in the system.
All these suggestions are time-consuming.Sometimes the demands of a production environment do not permit such a system review.Replace the eight DIMMs and then test them in a development system, later.
Bill at 2007-7-5 17:03:17 >

# 4
... and I an thinking that if you go back through messages.0, and messages.1, and messages.2 you might see some evidence of which board #7 DIMMs were having their correctible issues.
It could have been hours ago, days, ago, or weeks ago, continuous.
I have a suspicion and can make a guess, but it would only be a guess at this time. (J3201 and/or J3301)
Bill at 2007-7-5 17:03:17 >

# 5
I'm happy you agree with me on a memory problem.
I have only two DIMMS in my stockage, which slot is preferred to change 4 first?
Consider that it's a production environment so I'll not have so many shots to shoot, how can I find them from hex code? D'you have a table or a doc?
# 6
nope, no document, just experience and feedback comments from customer I've helped over the years.
If the old log files do not give you any hints, then...
In the first note that opened this discussion thread, you have some information from your MESSAGES file.
When written to the forum, long lines get wrapped and it is a bit difficult to see, but when you go back to the computer and look, you might notice that there are eight lines:
<b>Nov 13 05:32:57 rhea E$tag 0x00000000.1cc01164 E$State: Exclusive E$parity 0x0e
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x060337ff.01800180
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x08): 0xffff3100.1c746578 *Bad* PSYND=0x00ff
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x10): 0x696cf36f.6e74726f *Bad* PSYND=0xff00
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x6c736e63.31407669
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x7267696c.696f2e69
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x74ff0180.01800180
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x02c10201.80013001
Nov 13 05:32:57 rhea SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x38): 0x30018009.42393935
</b>
In your event, please notice that it is the second and the third line that have the <i>*Bad* PSYND=</i>as well.
That's why I guessed on the second and third DIMM of the bank.
Multibit errors often include more than one memory module.One is at fault and one is a victim in the event, but you cannot ever figure out which is which unless you try such things as Extended POST.
Bill at 2007-7-5 17:03:17 >

# 7
Thanks for taking over Bill, I went to look for a suitable document and ended up looking through SE xx00 problem solving manual.
simone79, sorry I did see your reference to extended POST and I should have asked " have you run the POST 3+ times " . The InfoDoc from the spectrum handbook have some interesting descriptions on DIMM replacement policy. I do remember reading somewhere about the hex code conversions, but I think was an internal only document.
Here is an interesting link that might be helpful:
<a href="http://sunsolve.sun.com/search/document.do?assetkey=1-9-82264-1&s earchclause=82264" target="_blank"> http://sunsolve.sun.com/search/document.do?assetkey=1-9-8226 4-1&searchclause=82264</a>
# 8
I'm wondering if Board 7 Bank 1 is 2048. 256 MB DIMMs are notorious for not leaving any clues behind. You could always move 1/2 of the DIMMs into another bank of the same size that is not giving you trouble. If the problem moves to that new bank, or if it stays with the old bank, then you have narrowed it down to 4 DIMMs. Repeat as necessary.
jds2n at 2007-7-5 17:03:17 >

# 9
Thank you to everyone, I found an unused board to substitute board 7 completely and I swapped it into a test server.
Server made 3 POSTs and now I'm running on it SUNvts and linpack but nothing happens.
I think is possible that everything was generated from a bad CPU replacement some weeks ago. Maybe the board was not close enough to the backplane and it was a little forced, I hope the slots will not results damaged by that.
# 10
Now a friend of mine has a similar problem, he's very far from my city so I can't see the server and I only have this message appeared at boot:
Rebooting with command: boot
Boot device: diskbrd File and args:
SunOS Release 5.8 Version Generic_117350-14 64-bit
Copyright 1983-2003 Sun Microsystems, Inc. All rights reserved.
WARNING: [AFT1] Uncorrectable Memory Error on CPU1 at TL=0, errID 0x00000028.9184e3e9
AFSR 0x00000001<ME>.80300000<PRIV,UE,CE> AFAR 0x00000000.00003cc0
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1014f10c
UDBH 0x0333<UE,CE> UDBH.ESYND 0x33 UDBL 0x034d<UE,CE> UDBL.ESYND 0x4d
UDBH Syndrome 0x33 Memory Module Board 2 J3100 J3200 J3300 J3400 J3500 J3600 J3700 J3800
WARNING: [AFT1] Uncorrectable Memory Error on CPU1 at TL=0, errID 0x00000028.9184e3e9
AFSR 0x00000001<ME>.80300000<PRIV,UE,CE> AFAR 0x00000000.00003cc0
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1014f10c
UDBH 0x0333<UE,CE> UDBH.ESYND 0x33 UDBL 0x034d<UE,CE> UDBL.ESYND 0x4d
UDBL Syndrome 0x4d Memory Module Board 2 J3100 J3200 J3300 J3400 J3500 J3600 J3700 J3800
panic[cpu1]/thread=2a1001ddd20: [AFT1] errID 0x00000028.9184e3e9 UE Error(s)
See previous message(s) for details
000002a1001dd3a0 SUNW,UltraSPARC-II:cpu_aflt_log+568 (2a1001dd45e, 1, 10155300, 2a1001dd5e8, 2a1001dd4ab, 10155328)
%l0-3: 00000300003a6a90 0000000000000003 000002a1001dd6b0 0000000000000010
%l4-7: 0000030001d8c290 0000000000000000 000002a75029c000 000002a100176fd0
000002a1001dd5f0 SUNW,UltraSPARC-II:cpu_async_error+868 (1046b370, 2a1001dd6b0, 180300000, 0, c7a6e6780300000, 2a1001dd870)
%l0-3: 0000000010475e90 0000000000000063 000000000000034d 0000000000000333
%l4-7: 0000000000003cc0 0000000000800000 0000000000800000 0000000000000001
000002a1001dd7c0 unix:prom_rtt+0 (f0803cc0, 3cc0, 800000, 0, 16, 14)
%l0-3: 0000000000000006 0000000000001400 0000004400001605 000000001014c848
%l4-7: 000002a75029c000 0000000000000000 0000000000000009 000002a1001dd870
000002a1001dd910 SUNW,UltraSPARC-II:scrub_ecache_line+2b4 (f0803cc0, c, 1046b370, 300002015d8, 30001dcdf40, 83)
%l0-3: 0000030001c49518 0000000000000003 0000000000000070 0000000000000000
%l4-7: 0000000000000000 0000000000800000 0000000000003cc0 0000000000000004
000002a1001dda60 SUNW,UltraSPARC-II:scrub_ecache_line_intr+30 (30001dcdf40, 1, 1, 2a1001ddd20, 102e0, 1014f27c)
%l0-3: 0000000000000001 0000000000000001 0000031001e7e8a0 000003000020df88
%l4-7: 0000029fffd82000 0000031005127540 0000031001e7e8f8 0000000000000000
syncing file systems... done
skipping system dump - no dump device configured
rebooting...
Resetting...
Software Power ON
He putted off board 2 and the server started correctly, nothing recorded in messages.*, He has not spare parts, what do you think about? Memory problem again?
# 11
Yes
Bill at 2007-7-5 17:03:17 >
