Solaris 10 (amd64) and spontaneously high load average problem
[ I may have posted this to the wrong forum "solaris 10 features", so I am reposting here, sorry. ]
I have a "server" running Solaris 10. It's an AMD64 server with 1 GB of RAM and a pair of 160 GB SATA disks on an Asus NForce4 board. The system uses SVM to mirror root, and ZFS for home directories, which are shared out via NFS and Samba to a closed network of 5 machines. That's pretty much all the machine does.
The problem is that I recently ran "smpatch update" on the machine to update to a more recent version of code than I was running (security patches, etc were needed, and this way seemed easy to do). After rebooting, the system is fine for up to 24 hours, sometimes only 12. After that, the load average climbs to the 4.0 range and sits there. vmstat shows the system with no running I/O and the CPU ~95% busy. prstat shows no single task using anything more than 0.1% of the CPU.
I have next to no experience with combing the kernel for information on what it's doing "under the hood", but I did see that when the system was "bad", I have 13000-ish "nthreads" in two places (under cpu). When the system is freshly booted, that number is closer to 1800, but seems to slowly climb over time. For instance, 30 minutes after booting back up and letting the system sit there doing nothing at all, it's up to 2100. I don't know if that's normal for a thread count in the kernel to do, but with no userspace jobs using the CPU, that's the only place I can look.
Any suggestions as to how to go about diagnosing and fixing this ?
[1573 byte] By [
mzh46609a] at [2007-11-27 6:39:04]

# 1
> The problem is that I recently ran "smpatch update"
> on the machine to update to a more recent version of
> code than I was running (security patches, etc were
> needed, and this way seemed easy to do). After
> rebooting, the system is fine for up to 24 hours,
> sometimes only 12. After that, the load average
> climbs to the 4.0 range and sits there. vmstat shows
> the system with no running I/O and the CPU ~95% busy.
> prstat shows no single task using anything more than
> 0.1% of the CPU.
Assuming that the only thing that you or anyone else has done to this machine is to patch it, then you might want to consider backing out the patches one at a time until the issue resolves itself.
After you figure out which patch caused the problems submit a bug report on sunsolve.com.
Alternately search Sunsolve for this condition. You also might want to see if there are newer versions of the patches that you added available.
alan
# 2
While I understand the backing out of each patch to see what it did, and it's not quite a monumental effort, I was really more interested if I could determine what all these kernel threads were doing. If I can't get a handle on that, I'll probably go the route of backing out patches.
Another option that I've been kicking around is just disabling all the services on it and re-enabling them one by one to see which one starts the kernel thread count growth (if that's even the problem, it just seems that an idle system shouldn't have that growth, which is why I am looking at it).
There is nobody else with access to the machine aside from my roommate who only uses one of the samba shares, let alone root access to the machine (only me). I never even installed anything else on it aside from fetchmail and pine. It's basically just stock+patches.
As of yesterday, it reports "No patches required." when I run "smpatch analyze". I had 118855-33 as the kernel patch on it before this started, and that same kernel revision when it initially started (after doing some patches). I installed 118855-36 to see if that would cure it, then I installed 125101-08. So, it would seem that I have the latest kernel on it.
# 4
Well, this is odd. There are only two processes that have accumulated any time in 12 hours of uptime: sched and fsflush. sched has accumulated 0:29 and fsflush 0:14. I've not done much of anything on the server this run, so I am not sure why the scheduler would have accululated that amount of time. Looking at a v480 with 6 days uptime shows 0:06 time accumulated on sched, and it's a cluster node. I'd think the scheduler works harder there.
Anyone have any insight ?