watcher and msprobe in cluster environment
Sun Java(tm) System Messaging Server 6.2-6.01 (built Apr 3 2006)
libimta.so 6.2-6.01 (built 11:20:35, Apr 3 2006)
On Solaris 10 Cluster 3.1
We recently did some failover testing in our messaging environment and using scswitch works beautifully - just as it should. However when we tried to mimic a processes dying (using kill) something weird happened. Will issued kill (no -9) on the mshttpd processes (we run 2) and sure enough when msprobe probed, it restarted.
[28/Sep/2006:05:15:00 -0400] imsched[20787]: General Debug: starting /global/jeshome/SUNWmsgsr/lib/msprobe
[28/Sep/2006:05:15:00 -0400] msprobe[22027]: General Error: function=getserverresponse|port=80|error=failed to connect
[28/Sep/2006:05:15:00 -0400] msprobe[22027]: General Critical: HTTP server is not responding
[28/Sep/2006:05:15:00 -0400] msprobe[22027]: General Critical: mshttpd restart requested
. . .
[28/Sep/2006:05:15:00 -0400] stored[20781]: Store Warning: stored: Sun Java(tm) System Messaging Server STORE 6.2-6.01 (built
Apr 3 2006) shutting down
[28/Sep/2006:05:15:01 -0400] stored[22030]: Store Warning: stored: Sun Java(tm) System Messaging Server STORE 6.2-6.01 (built
Apr 3 2006) starting up
We decided to kill mshttpd again and after 30 minutes nothing got restarted. msprobe probed as it scheduled (every 10 minutes as default), but it never started up mshttpd again.
[28/Sep/2006:05:25:00 -0400] imsched[20787]: General Debug: starting /global/jeshome/SUNWmsgsr/lib/msprobe
[28/Sep/2006:05:25:00 -0400] msprobe[22849]: General Error: function=getserverresponse|port=80|error=failed to connect
[28/Sep/2006:05:25:00 -0400] msprobe[22849]: General Critical: HTTP server is not responding
[28/Sep/2006:05:25:00 -0400] msprobe[22849]: General Critical: restart: mshttpd process is not running
[28/Sep/2006:05:35:00 -0400] imsched[20787]: General Debug: starting /global/jeshome/SUNWmsgsr/lib/msprobe
[28/Sep/2006:05:35:00 -0400] msprobe[23678]: General Error: function=getserverresponse|port=80|error=failed to connect
[28/Sep/2006:05:35:00 -0400] msprobe[23678]: General Critical: HTTP server is not responding
[28/Sep/2006:05:35:00 -0400] msprobe[23678]: General Critical: restart: mshttpd process is not running
We ended up stopping all services and restarted with scswitch and things came up fine.
I'm wondering if the watcher/msprobe thought we wanted mshttpd down so it didn't restart? Or is there something else going on here? Here's some other configs:
# ./configutil |grep probe
local.schedule.msprobe = "5,15,25,35,45,55 * * * * /global/jeshome/SUNWmsgsr/lib/msprobe"
# ./configutil |grep watcher
local.watcher.enable = yes
# ./configutil |grep autorestart
local.autorestart = yes
# ./configutil |grep readtimeout
service.readtimeout = 30

