Zombie CPU load
Hello,
We're running a X4100 with 2 CPU's on SOlaris 10. We see that we have a CPU-load (usr) of over 50 % (always).
If we look at ps, prstat, vmstat, iostat or netstat we can not see any process that is responsible for this CPU load. How do I find out what is responsible for this CPU-load ?
Is this a known issue on a X4100 ?
Regards,
WJ
[379 byte] By [
WJ-KPa] at [2007-11-26 17:06:42]

# 1
Sometimes it's easier to see with "top". Have you tried that?
# 2
Too bad top is not part of Solaris 10 (as far as I know). prstat is the Solaris version of top. We've tried that. Our biggest process takes 0.1% of CPU (which seems correct).WJ
# 3
Hi,Have looked at using prstat -m or DTrace?This might help:h t t p: / / www. brendangregg. com/DTrace/lostcpu.html
HvRa at 2007-7-8 23:34:28 >

# 4
Hi HvR,
Thanks for the pointer to the website. This gives a lot of valuable info!
However, the puzzle isn't solved yet.
When we run "mpinfo 1" the output looks like:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0000345 20285000021000 100
1000870000000 10000
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0000341 20081000021000 100
1000760000000 10000
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0000353 20698000029000 100
1000980000000 10000
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0000346 20288000021000 100
1000870000000 10000
So one CPU spends 100% of its time on system calls while no system calls are processed ?
Is this a hardware problem ?
The problem is reproducible. After I reboot my system everything is OK for about 45 minutes. Then CPU1 takes of and stays at 100%.
# 5
try running the command "intrstat" and look at the %tim. Unless you use USB (uhci) unload the module.null
HvRa at 2007-7-8 23:34:28 >

# 6
When I run intrstat when both CPU are almost idle I get stats like this
device |cpu0 %timcpu1 %tim
-+
e1000g#1 | 0 0.0 1 0.0
e1000g#0 | 0 0.0 1 0.0
e1000g#2 | 0 0.019641 0.0
e1000g#3 | 0 0.01345 0.0
mpt#0 | 0 0.0 0 0.0
ohci#1 | 0 0.0 0 0.0
ohci#0 | 0 0.0 0 0.0
device |cpu0 %timcpu1 %tim
-+
e1000g#1 | 0 0.0 1 0.0
e1000g#0 | 0 0.0 1 0.0
e1000g#2 | 0 0.039813 0.0
e1000g#3 | 0 0.01526 0.0
mpt#0 | 0 0.0 0 0.0
ohci#1 | 0 0.0 0 0.0
ohci#0 | 0 0.0 0 0.0
Why are there no interupts on CPU0 ?
I only have e1000g0 connected so why the interrupts from e1000g2 and 3 ?
Is this a faulty ethernet card ?
# 7
Although you might have a bad driver or hardware flogging your system, as others have suggested, Look for processes or threads that are being created at a rapid rate and immediately exiting. These are hard to see with ps and top. To some extent you can see this if the process IDs do not increment by a small amount each time you see a "known good" process start.
That's not an astounding number of interrupts for a gig-E interface and is probbaly legit. I'd look to see if a patch is available, though.
The load is 50% since that's the average of 0 and 100%.
# 8
Yes, that is nothing to worry about. I see the same thing on my hp hosts with the bcme NICs. I also just have one connected and see it on both, I guess since they are on the same pci bus?
At least we know your CPU is not being eaten by interrupts.
I am not sure but I think interrupts are only made to a physical CPU and this might indicate a dual core CPU instead of dual CPU? (I am not savvy on the these parts)
Did you assign any resource pools? 損ooladm?will print.
What is your kernel rev? 搖name 杤?br>
Long shot,
check an old notice from AMD
http://www.amd.com/us-en/0,,3715_13965,00.html?redir=CORPR01
maps backup to a sun alert
http://sunsolve.sun.com/pub-cgi/show.pl?target=sunalert_patches#Sun_Fire_X4100_ Server
--
4. Relief/Workaround
Because this issue could manifest itself in various ways, it is advisable to contact your Sun Services representative to determine the proper course of action. Sun is working closely with AMD to quickly identify and contact customers who have products containing AMD Opteron models x52 and x54 parts that could be implicated by AMD's Production Notice to properly identify any parts that may be affected, and to replace them at no charge and in the most expeditious way through Sun's world wide service organization.
if you are affected you might get new CPUs, who knows after that.
揺levated CPU temperatures?oh no, better check it. (lol, I know, I just like looking at the output.)
"/usr/sfw/bin/ipmitool sdr list all"
Like I said long shot and I probably should get shot for posting this.
Sorry, but it is very strange that the CPU is fully pegged all the time. I have never seen this before without the culprit showing up in prstat.
If you do find a fix/solution please post it.
Thank you.
HvRa at 2007-7-8 23:34:28 >

# 9
I've done some more testing:
- my CPU's are model 248 revision "E" so that should be OK.
- I've swapped the disks to another X4100 (same HW config) and the behavior is the same: So now I think it is not a faulty CPU or mem.
- I really can't find out which process is locking my CPU. When i do "prstat -P" on my locked CPU I get that all processes are sleeping.
- Once the CPU is locked, no matter how many processes I kill, it remains at 100% sys.
- "intrstat" gives strange results from which I can npt draw any conclusions.
Any further ideas ?
# 10
Please post your kernel rev, also the output from pooladm
HvRa at 2007-7-8 23:34:28 >

# 11
The output of "uname -v" is "Generic_118855-14"pooladm is not installed on the machine.
# 12
I was never happy with a release before 118855-15 (I am not just saying this since you have one level lower, I am serious, had SAN issues)
Still a few on ?5 and they are solid.
I would recommend updating your kernel rev. If you go for 118855-36 there are a few things according to others that you need to watch out for in terms of the boot loader GRUB.
I had no issues what so ever to update, but I updated from -15 (not sure if it matters, I even rolled the patch out since one of my zones got skipped due to a dependency issue, fixed it and re-applied it.)
I only had a problem with showrev 杙 after the update, but a patch was released for that.
You have to look at it this way, you have to update at some point, well I do, form a security perspective and also from a functional perspective.
HvRa at 2007-7-8 23:34:28 >

# 13
Hmm...
We have already about 30 X4100's in this configuration installed at our customers. So we don't really want to do a major system patch.
Some further testing revealed that the problem is probably caused by one of our own processes. If we don't start that process, the problem does not occur. The strange thing is that the suspect program keeps running correctly after the CPU is locked and that killing the process does not free the CPU.
# 14
I think we' ve nailed the problem now.
It appears that the problem occurs when we access an ethernet interface that has no cable connected to it. If we try this for a long time (about 45 minutes) eventually one of the processors locks up into 100% sys. The process accessing the interface keeps running however.
If we run the same process on a connecter ethernet inteface everything is OK.
Is this a know issue on a X4100 ?
# 15
top is available at sunfreeware.comIt's one of the first things I always go get after setting up a new system. I don't like prstart... Doesn't give as much info.
# 16
This appears to be a known bug (6405012). Is there a fix available for it ?
WJ-KPa at 2007-7-21 16:57:26 >

# 17
Hi,
Not sure about your app/process, but did you look at network aggregation?
I am not sure if there is a fix, but if there is not (I know this is not a clean way) maybe you should consider crimping a gig-E loopback jack and leave it connected to interface in question. You do not have to assign a IP etc. it will just show the interface as up,
Can you not maybe check the inf status with dladm show-dev and skip ones that are not in an up state in your process?
http:// www.kozio. com/downloads/kozio_tn_GigabitEthernetLoopback.php
http:// docs.sun. com/source/819-6456/AppendixA.html#37279
HvRa at 2007-7-21 16:57:26 >
