Ultra 20 crashes (WARNING: MCE 0: error code ...)
Hi experts, I guess this is a tough one.
On my Ultra20 with the Opteron 180, I got the following message before system shutdown:
At the time I was running two instances of ROOT (root.cern.ch[\code]), so both cores were working.
Any idea what is going on? Would dtrace be of any help here?
I'm running
SunOS icet8 5.10 Generic_118844-26 i86pc i386 i86pc
Cheers & thanks,
Peter.
[code]
May 6 20:07:44 icet8 unix: [ID 820216 kern.warning] WARNING: MCE: Bank 0: error code 0x833:addr = 0x28aaf0c0, model errcode = 0x0
May 6 20:07:44 icet8 unix: [ID 470315 kern.warning] WARNING: MCE: Bank 2: error code 0x863, mserrcode = 0x0
May 6 20:07:44 icet8 unix: [ID 836849 kern.notice]
May 6 20:07:44 icet8 ^Mpanic[cpu1]/thread=ffffffff83448a40:
May 6 20:07:44 icet8 genunix: [ID 683410 kern.notice] BAD TRAP: type=10012 (#- trap) rp=fffffe80006d6f20 addr=0
May 6 20:07:44 icet8 unix: [ID 100000 kern.notice]
May 6 20:07:44 icet8 unix: [ID 839527 kern.notice] root.exe:
May 6 20:07:44 icet8 unix: [ID 753105 kern.notice] #mc Machine check
May 6 20:07:44 icet8 unix: [ID 243837 kern.notice] pid=10151, pc=0xfe75ac97, sp=0x802fee0, eflags=0x202
May 6 20:07:44 icet8 unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse>
May 6 20:07:44 icet8 unix: [ID 354241 kern.notice] cr2: 9ce9000 cr3: 7f04e000 cr8: c
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]rdi: 9128c98 rsi:1c rdx: 16b0b4e0
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]rcx: 9c5a33d r8: 100 r9:1
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]rax: 98a22c0 rbx: fecedc78 rbp: 802fee4
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]r10: ffffffff81409780 r11:0 r12:15
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]r13:3 r14: ffffffff8156d710 r15: fffffe80006d6f20
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]fsb: ffffffff80000000 gsb: fee52000 ds:43
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]es:43 fs:0 gs: 1c3
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]trp:12 err:0 rip: fe75ac97
May 6 20:07:44 icet8 unix: [ID 592667 kern.notice]cs:3b rfl: 202 rsp: 802fee0
May 6 20:07:44 icet8 unix: [ID 266532 kern.notice]ss:43
May 6 20:07:44 icet8 unix: [ID 100000 kern.notice]
May 6 20:07:44 icet8 genunix: [ID 655072 kern.notice] fffffe80006d6e30 unix:real_mode_end+4831 (ffffffff840500d8, 0, fffffe)
May 6 20:07:44 icet8 genunix: [ID 655072 kern.notice] fffffe80006d6f10 unix:trap+913 ()
May 6 20:07:44 icet8 unix: [ID 100000 kern.notice]
May 6 20:07:44 icet8 genunix: [ID 672855 kern.notice] syncing file systems...
May 6 20:07:44 icet8 genunix: [ID 733762 kern.notice] 7
May 6 20:07:45 icet8 genunix: [ID 733762 kern.notice] 2
May 6 20:07:47 icet8 genunix: [ID 904073 kern.notice] done
May 6 20:07:48 icet8 genunix: [ID 111219 kern.notice] dumping to /dev/dsk/c1d0s1, offset 419495936, content: kernel
May 6 20:07:52 icet8 genunix: [ID 409368 kern.notice] ^M100% done: 53565 pages dumped, compression ratio 2.85,
May 6 20:07:52 icet8 genunix: [ID 851671 kern.notice] dump succeeded
[3310 byte] By [
niessepe] at [2007-11-26 7:04:10]

# 3
Hello iCreate,
short story:
I didn't upgrade, but ordered with the 180 built in.
long story:
Looking at the sun website, it listed the large configuration of the Ultra 20 with the 180. When proceeding to ordering, the website showed the 152.
I phoned them about this and they said "Hm, strange. What do you want to buy?" . What I needed was something with 2 processors, so I could run some development and production at the same time. I had also considered the small version of the Ultra 40, which, after upgrading to 2 Opteron 246 and 2 GB, costs around the same as the Ultra 20 large.
However, some colleagues bought a x4100 and x2100 (2x246 / 1x175) respectively and it turned out that for my application the x2100 was faster.
Since the lady on the phone assured me that I can get a 180 in the Ultra 20, I opted for that.
I'm with an educational institution and got a 5% discount (My colleague needed a new workstation to, so, since the HP Alphas were ~60000 $, he went for the Ultra 40. He configured 580 GB of diskspace and paid only 300$ more than me. But this might have worked only because our computing centre buys some Sun stuff regularly.)
The order was placed mid February. I got two mails (beginning of March and April) "that the part number will become available mid March and mid April". I felt a bit like the guy with "the sun don't shine on me (http://joyeur.com/2006/03/20/the-sun-doesnt-shine-on-me) but after a while and whining on the phone and asking to speak to a manager, things started moving in end of April. I guess this is the place to thank J. Nelson for making things happen.
The reason of the delay wasn't entirely clear to me. One person told me that the 180 is a new component and the Operational Health and Safety issues had to be worked on (component compliance). OK, this is now speculation, but in the release notes it mentioned that some 64bit Kernels (the SuSE enterprise) would panic when installing 4 GB of main memory.
When the machine was finally delivered two weeks ago, the memory was faulty (but I did not realise until last weekend, where I ran some jobs on both processors and the temperature got up). Thanks to the included diagnostic software (PcCheck), the problem was found pretty quick. I phoned, received a process number and got called within one hour. Mr Mitchell enquired about what was going on and I gave him the diagnostic output of the memory tests. He asked whether I would be able to install the memory myself of have someone do it for me, and I opted for someone coming around (in case it's not the memory). Indeed, the man showed up the next day, installed the dimms and ran the check again, now without failure.
I'm running some stuff on the box right now, so keep your fingers crossed.
So, if you can get some educational discount, take also a look at the Ultra40. It's beautifully built (even nicer than the 20), although it runs a bit slower, but you have the option of going up to 16 MB.
On the downside is the noise level. Since the thing is intended to run at 100% load at least 50% of the time, it's a bit of an issue. I've put it below my desk, so it's hushed a bit.
Hope this helps,
Cheers, Peter.
PS: After keeping the machine at load around 2.0 for 5 days continuously, it appears as if the problem has been retired by replacing the memory.