Strange CPU usage problem

We've got a new V890 with Solaris 10 that's been running fine for a month. This morning the load average rose from the normal 2-5 range to around 120 in the space of about 20 min. CPU utilization was at 95%+ in the sys catagory, 2-3% for user, and 0% for IO. I checked the processes via prstat, but nothing was using more that 1-2%, and they were user processes too. Any ideas what might have caused this?What should I look for if it happens again?

[457 byte] By [ifinzena] at [2007-11-27 11:28:03]
# 1

Any ps outputs from that timeframe? They really would have helped. I guess depending on whether or not the machine has normalized again, it could be some standard thing that happens, like an application sending out a mass mailing, or a web server being hit hard by everyone in your company all at once, or if it has external presence, it could be anything. If you don't have the process tables from when it happened, then look in last output to see if there were any strange goings on at that time (as well as inspect the console and /var/adm/messages just in case something wrote to file) - jeff

jeffrey.sa at 2007-7-29 16:19:43 > top of Java-index,General,Talk to the Sysop...
# 2

High sys time, low user time (and you'll never see any I/O time).

It's possible that so many processes were spawned simultaneously that the scheduler was overwhelmed for a while.

So there was probably nothing wrong with the machine, just that it was very busy for a while. What's the application? That's where I'd be looking.

--

Darren

Darren_Dunhama at 2007-7-29 16:19:43 > top of Java-index,General,Talk to the Sysop...
# 3

It happened again this morning, only an hour and a half earlier than yesterday. I checked the processes via prstat and nothing was consuming more than 3%. The number of users on the system wasn't unusual either. The box has Oracle10.2.0.3 and the CRM app Glovia on it. I used prstat -mL to find that all of the processes were spending most of their time in LAT, but I can't figure out what they're waiting for. The app vendor suggested a couple of parameter changes that we're going to try tonight. At this point it looks like it's something with the app, but I can't narrow down exactly what.

Message was edited by:

ifinzen

ifinzena at 2007-7-29 16:19:43 > top of Java-index,General,Talk to the Sysop...
# 4

When you say that none of the processes was consuming more than 3%, that's entirely possible. If you flood the machine with a lot of processes all at once, and each process' priority is fairly similar, this is probably the expected scenario. When this happened today, did you capture any ps -ef outputs? If so, how many processes were out there with pcpu > 0? also, did you capture a vmstat output from about the same time? It could be that you've got a very single threaded app that's trying to run many executions all at once, again, so long as they clean up and the system normalizes, then this all sounds as expected, perhaps tunable, but probably quite normal.

jeffrey.sa at 2007-7-29 16:19:43 > top of Java-index,General,Talk to the Sysop...