memory errors - where to go next

Hi there, I came in this morning to the following errror

Sep 23 00:04:53 my.machine.com unix: [ID 752700 kern.warning] WARNING: [AFT0] Sticky Softerror encountered on Memory Module J0202

Does anybody the next step i should go to to investigate this issue?

any help on this would be greatly appreciated

cheers

Edit/Delete Message

[365 byte] By [hcclnoodles] at [2007-11-26 10:21:10]
# 1

Was that the only instance of such a log entry?

... if "Yes", then ignore it.

If that or similar log entries are occurring hourly or more frequently,

then consider replacing that one memory module in slot J0202.

Solaris can sometimes be too verbose in its information, although

I would rather have it that way then have it tell me nothing.

By the way, [AFT0] and [AFT2] entries are informational only, not critical.

The serious ones are the [AFT1] entries.

There is a Sun Infodoc that describes many sorts of such errors,

and can help to understand each of them as they may occur.

You will need to log in to Sunsolve to get access to it, though.

Spectrum Infodoc 70361

" Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages "

rukbat at 2007-7-7 2:20:24 > top of Java-index,General,Talk to the Sysop...
# 2
... and go get that CEDIAG diagnostic utility,as you were advised in the other forum web site: http://www.tek-tips.com/viewthread.cfm?qid=1282104&page=1
rukbat at 2007-7-7 2:20:24 > top of Java-index,General,Talk to the Sysop...
# 3
is there an alternative to cediag for x86 boxes ?ps thanks , I have installed cediag on my sparc boxes ...My main concern is that the error is sticky, and i was under the impression that this is a permanent / unrecoverable hardware error?
hcclnoodles at 2007-7-7 2:20:24 > top of Java-index,General,Talk to the Sysop...
# 4

If your log entry was the only time you've had such an event, then ignore it.

If you have many errors per day on the exact same DIMM, then open a support case with Sun.

Get a copy of that Infodoc.

I realize it is a long one and somewhat detailed, but it discusses

virtually every conceivable type of error on DIMMs and in ecache.

An ECC checksum examination can return three possible results.

<> no error and everything proceeds on its merry way.

<> an intermittant error which produces a log entry for that result, and everything proceeds.

<> a sticky error, a log entry, and everything proceeds as the hardware and software is designed.

Such a process is always happening whenever data is held in a register.

RAM and ecache are dynamic constructs, and require continual refreshing.

Column-Address-Strobe (CAS) and Row-Address-Strobe (RAS) signals

are continually maintaining the data.

Every data-refresh includes a validity check, and the ECC design

that Sun uses includes a checksum in the validation.

If the data bits are identical, all is happy. The data refresh was accurate.

If the data bits happen to give a change in checksum, there's another exam.

If the checksum evolves and changes and is again 'wrong' then that's an intermittant error.

If the checksum isn't what is expected and the second examination

shows an identical offset from what was expected, then that's a sticky error.

A sticky error is not better or worse than an intermittant error.

It's just a different sort of error.

Only become concerned if your system cannot clear the errors.

... as is discussed in that Infodoc.

rukbat at 2007-7-7 2:20:24 > top of Java-index,General,Talk to the Sysop...