watcher and msprobe in cluster environment

Sun Java(tm) System Messaging Server 6.2-6.01 (built Apr 3 2006)

libimta.so 6.2-6.01 (built 11:20:35, Apr 3 2006)

On Solaris 10 Cluster 3.1

We recently did some failover testing in our messaging environment and using scswitch works beautifully - just as it should. However when we tried to mimic a processes dying (using kill) something weird happened. Will issued kill (no -9) on the mshttpd processes (we run 2) and sure enough when msprobe probed, it restarted.

[28/Sep/2006:05:15:00 -0400] imsched[20787]: General Debug: starting /global/jeshome/SUNWmsgsr/lib/msprobe

[28/Sep/2006:05:15:00 -0400] msprobe[22027]: General Error: function=getserverresponse|port=80|error=failed to connect

[28/Sep/2006:05:15:00 -0400] msprobe[22027]: General Critical: HTTP server is not responding

[28/Sep/2006:05:15:00 -0400] msprobe[22027]: General Critical: mshttpd restart requested

. . .

[28/Sep/2006:05:15:00 -0400] stored[20781]: Store Warning: stored: Sun Java(tm) System Messaging Server STORE 6.2-6.01 (built

Apr 3 2006) shutting down

[28/Sep/2006:05:15:01 -0400] stored[22030]: Store Warning: stored: Sun Java(tm) System Messaging Server STORE 6.2-6.01 (built

Apr 3 2006) starting up

We decided to kill mshttpd again and after 30 minutes nothing got restarted. msprobe probed as it scheduled (every 10 minutes as default), but it never started up mshttpd again.

[28/Sep/2006:05:25:00 -0400] imsched[20787]: General Debug: starting /global/jeshome/SUNWmsgsr/lib/msprobe

[28/Sep/2006:05:25:00 -0400] msprobe[22849]: General Error: function=getserverresponse|port=80|error=failed to connect

[28/Sep/2006:05:25:00 -0400] msprobe[22849]: General Critical: HTTP server is not responding

[28/Sep/2006:05:25:00 -0400] msprobe[22849]: General Critical: restart: mshttpd process is not running

[28/Sep/2006:05:35:00 -0400] imsched[20787]: General Debug: starting /global/jeshome/SUNWmsgsr/lib/msprobe

[28/Sep/2006:05:35:00 -0400] msprobe[23678]: General Error: function=getserverresponse|port=80|error=failed to connect

[28/Sep/2006:05:35:00 -0400] msprobe[23678]: General Critical: HTTP server is not responding

[28/Sep/2006:05:35:00 -0400] msprobe[23678]: General Critical: restart: mshttpd process is not running

We ended up stopping all services and restarted with scswitch and things came up fine.

I'm wondering if the watcher/msprobe thought we wanted mshttpd down so it didn't restart? Or is there something else going on here? Here's some other configs:

# ./configutil |grep probe

local.schedule.msprobe = "5,15,25,35,45,55 * * * * /global/jeshome/SUNWmsgsr/lib/msprobe"

# ./configutil |grep watcher

local.watcher.enable = yes

# ./configutil |grep autorestart

local.autorestart = yes

# ./configutil |grep readtimeout

service.readtimeout = 30

[2935 byte] By [JHU_JES] at [2007-11-26 10:26:31]
# 1

Check your configutil settings for

local.autorestart

Unless this is set to "yes", "1", or "on", watcher will not restart your services.

It's possible you have hit some bug I don't know about in watcher or the restart process.If your configuration is set correctly, please contact Support.

jay_plesset at 2007-7-7 2:30:12 > top of Java-index,E-Mail, Calendar, & Collaboration,Sun Java System Messaging Server...