Monitoring hardware with SNMP

I am performing a Systems Admin role for 30 Solaris 10 systems (consisting of 67 Solaris zones and domains) at present.

My support team colleagues have installed Nagios (an open source SNMP based monitoring software) to enable alerting should any issues occur.

This works fantastically for our database tables, applications, and even elements such as memory usage, CPU usage, filesystem usage, processes running etc.

However, what this seems to lack is the ability to detect physical hardware failures.

For example, over the past month, I have *by pure chance* (walking past the systems and seeing error lights, or probing the system for training purposes to the other Sys Admin in my team) discovered two hardware failures, which we are not picking up with Nagios.

These were 1) a hard disk failure - all filesystems on this hard disk were RAID 1 metadevices, so from an filesystem / Nagios POV there was no error.

2) A PSU failure - there is a redundant PSU in this system, so we did not lose the system, and from a functionality POV we did not have any issues.

Each of these errors shows up in the messages files, and also in the hardware probes Sys Admins run, such as prtdiag, or cfgadm for the disk, but Nagios failed to detect them.

So - my question is around whether it is possible to use snmp to query the hardware status on a Solaris 10 system?

I know SunMC is a monitoring suite which uses SNMP tables to query usage and hardware, so know it should be possible - I am just unable to figure out which tables to query, etc.

Help, please!

[1613 byte] By [Dougiesica] at [2007-11-27 10:08:37]
# 1
What we're doing is writing a custom nagios module to scan /var/adm/messages and watch for problems. The hard bit it tuning it so it alerts for real problems and ignores the dross.
robert.cohena at 2007-7-13 0:45:16 > top of Java-index,Solaris Operating System,Solaris 10 Features...