Resource tunables

IPMP failure triggers SharedAddress to failover, but is taking ~ 4 mins. Which tunable can be adjusted to fix this?

Probably in the following fashion,

scrgadm -c -j cluster -y tunable="something"

or via the scsetup menu.

I'm trawling through man page for r_properties so if I find something I'll post the solution back here. Otherwise, if someone knows, please be my guest.

The defaults are as follows:

Resource_dependencies <NULL>

Resource_dependencies_weak<NULL>

Resource_dependencies_restart <NULL>

PRENET_START_TIMEOUT300

MONITOR_CHECK_TIMEOUT 300

MONITOR_STOP_TIMEOUT300

MONITOR_START_TIMEOUT 300

BOOT_TIMEOUT300

FINI_TIMEOUT300

INIT_TIMEOUT300

UPDATE_TIMEOUT300

VALIDATE_TIMEOUT300

STOP_TIMEOUT300

START_TIMEOUT500

Failover_modeSOFT

[885 byte] By [diggles] at [2007-11-26 10:48:43]
# 1

What time period are you measuring? Is this from the point at which the IPMP group fails until the application is ready again? If so, what period is taking the longest? Normally, it is application start up that takes the time. IPMP failures usually trigger a give-over in about 10 seconds. So the rest of the time is usually the service stop, the switch-over and the restart.

What service are we talking about here?

Tim

TimRead at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2
It takes 10 seconds to switchover, but failover (unplug both public interfaces on the master node to cause IPMP failure) takes 4m12s on average to transfer control to the other node.
diggles at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

It would be helpful to have more details on the service first of all. What is the service you are running? Next it would help to have some timings for each stage of the give-over. What is taking longest?

I've seen issues like this before where the name service hasn't been configured correctly in the nsswitch.conf and it takes the service, e.g. NFS a long time to shutdown.

Tim

TimRead at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4

When IPMP is offline, the SharedAddress should also go offline immediately and failover. This does not happen, so it does not even get to the stage of starting/stopping services, it wedges itself with the IPMP failure detection. I don't think stop/start times on services are relevant to the troubleshooting at this stage. I'll give you a detailed sequence of events if it helps though.

diggles at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5
That sounds odd. Messages from /var/adm/messages would be help from around the point of failure. I assume the system is fairly fully patched too?Some info on the exact service configuration would also help.Thanks,Tim
TimRead at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6

1. STATUS PRIOR TO FAULT

###################################################

Resource: clusternet nodetwo OnlineOnline - SharedAddress online.

Resource: httpdnodetwo OnlineOnline

Resource: squidnodetwo OnlineOnline

Resource: namednodetwo OnlineOnline

2. IMMEDIATELY AFTER UNPLUGGING PUBLIC ETHERNETS

###################################################

Saturday October 21 16:01:48 EST 2006

Resource: clusternet nodetwo Online but not monitored Degraded - IPMP Failure.

3. AFTER TIMEOUT PERIOD

###################################################

Saturday October 21 16:05:18 EST 2006

Resource: clusternet nodeone OnlineOnline - SharedAddress online.

Resource: httpdnodeone OnlineOnline

Resource: squidnodeone OnlineOnline

Resource: namednodeone OnlineOnline

Looks like httpd is the culprit... but I don't yet know why.....

nodetwo /var/adm# fgrep "Oct 21 16" messages|fgrep httpd

Oct 21 16:01:46 nodetwo Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <gds_monitor_stop> for resource <httpd>, resource group <cluster-services>, timeout <300> seconds

Oct 21 16:01:46 nodetwo Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <gds_monitor_stop> for resource <httpd>, resource group <cluster-services>, timeout <300> seconds

Oct 21 16:01:47 nodetwo SC[SUNW.gds:5,cluster-services,httpd,gds_monitor_stop]: [ID 227820 daemon.info] Attempting to stop the data service running under process monitor facility.

Oct 21 16:01:47 nodetwo SC[SUNW.gds:5,cluster-services,httpd,gds_monitor_stop]: [ID 675776 daemon.info] Stopped the fault monitor.

Oct 21 16:01:47 nodetwo Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <gds_monitor_stop> completed successfully for resource <httpd>, resource group <cluster-services>, time used: 0% of timeout <300 seconds>

Oct 21 16:01:47 nodetwo Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <gds_monitor_stop> completed successfully for resource <httpd>, resource group <cluster-services>, time used: 0% of timeout <300 seconds>

Oct 21 16:01:47 nodetwo Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <gds_svc_stop> for resource <httpd>, resource group <cluster-services>, timeout <300> seconds

Oct 21 16:01:47 nodetwo Cluster.RGM.rgmd: [ID 707948 daemon.notice] launching method <gds_svc_stop> for resource <httpd>, resource group <cluster-services>, timeout <300> seconds

Oct 21 16:01:47 nodetwo SC[SUNW.gds:5,cluster-services,httpd,gds_svc_stop]: [ID 721263 daemon.info] Extension property <stop_signal> has a value of <15>

Oct 21 16:05:07 nodetwo SC[SUNW.gds:5,cluster-services,httpd,gds_svc_stop]: [ID 227820 daemon.info] Attempting to stop the data service running under process monitor facility.

Oct 21 16:05:07 nodetwo SC[SUNW.gds:5,cluster-services,httpd,gds_svc_stop]: [ID 401400 daemon.info] Successfully stopped the application

Oct 21 16:05:07 nodetwo Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <gds_svc_stop> completed successfully for resource <httpd>, resource group <cluster-services>, time used: 66% of timeout <300 seconds>

Oct 21 16:05:07 nodetwo Cluster.RGM.rgmd: [ID 736390 daemon.notice] method <gds_svc_stop> completed successfully for resource <httpd>, resource group <cluster-services>, time used: 66% of timeout <300 seconds>

diggles at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 7

The plot thickens...

Oct 21 17:53:37 nodeone SC[SUNW.gds:5,cluster-services,httpd,gds_svc_stop]: The stop command </etc/init.d/httpd stop> failed to stop the application. Will now use SIGKILL to stop the application.

I'll have to fix that stop script.

Thanks for the help Tim!

diggles at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 8
I'm not sure what I've done apart from suggest you look at the logs! Good luck.Tim
TimRead at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 9
Sometimes the most obvious suggestions are the most helpful :-)
diggles at 2007-7-7 3:01:12 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...