memory errors - where to go next
Hi there, I came in this morning to the following errror
Sep 23 00:04:53 my.machine.com unix: [ID 752700 kern.warning] WARNING: [AFT0] Sticky Softerror encountered on Memory Module J0202
Does anybody the next step i should go to to investigate this issue?
any help on this would be greatly appreciated
cheers
Edit/Delete Message
# 1
Was that the only instance of such a log entry?
... if "Yes", then ignore it.
If that or similar log entries are occurring hourly or more frequently,
then consider replacing that one memory module in slot J0202.
Solaris can sometimes be too verbose in its information, although
I would rather have it that way then have it tell me nothing.
By the way, [AFT0] and [AFT2] entries are informational only, not critical.
The serious ones are the [AFT1] entries.
There is a Sun Infodoc that describes many sorts of such errors,
and can help to understand each of them as they may occur.
You will need to log in to Sunsolve to get access to it, though.
Spectrum Infodoc 70361
" Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages "
# 4
If your log entry was the only time you've had such an event, then ignore it.
If you have many errors per day on the exact same DIMM, then open a support case with Sun.
Get a copy of that Infodoc.
I realize it is a long one and somewhat detailed, but it discusses
virtually every conceivable type of error on DIMMs and in ecache.
An ECC checksum examination can return three possible results.
<> no error and everything proceeds on its merry way.
<> an intermittant error which produces a log entry for that result, and everything proceeds.
<> a sticky error, a log entry, and everything proceeds as the hardware and software is designed.
Such a process is always happening whenever data is held in a register.
RAM and ecache are dynamic constructs, and require continual refreshing.
Column-Address-Strobe (CAS) and Row-Address-Strobe (RAS) signals
are continually maintaining the data.
Every data-refresh includes a validity check, and the ECC design
that Sun uses includes a checksum in the validation.
If the data bits are identical, all is happy. The data refresh was accurate.
If the data bits happen to give a change in checksum, there's another exam.
If the checksum evolves and changes and is again 'wrong' then that's an intermittant error.
If the checksum isn't what is expected and the second examination
shows an identical offset from what was expected, then that's a sticky error.
A sticky error is not better or worse than an intermittant error.
It's just a different sort of error.
Only become concerned if your system cannot clear the errors.
... as is discussed in that Infodoc.