GDS creation problems
Hello,
we'd like to write a set of wrapper scripts that will monitor and failover a simple webserver (lighttpd) with SC 3.2.
After registration of the service with the following command:
clresource create -g"my-rg" -t SUNW.gds \
-p Start_command="/usr/local/scripts/cluster_start_webgui.sh" \
-p Stop_command="/usr/local/scripts/cluster_stop_webgui.sh" \
-p Probe_command="/usr/local/scripts/cluster_probe_webgui.sh" \
-p Validate_command="/usr/local/scripts/cluster_validate_webgui.sh" \
-p Start_timeout=120 \
-p Stop_timeout=120 \
-p Probe_timeout=120 \
-p Network_resources_used=my-logical-name \
-p Scalable=false \
-p Failover_enabled=true \
-p Child_mon_level=0 \
-p Stop_signal=9 \
my-gds-resource
We received the following warning:
clresource: cluster_host1 -
Current setting of Retry_interval = 370,
might prevent failover on repeated probe failures.
It is recommended that Retry_interval be greater than or equal to
[(Thorough_probe_interval + Probe_timeout) * 2 * Retry_count].
Current values are (Thorough_probe_interval = 60,Retry_count = 2,Probe_timeout = 120).
Which appeared safe enough to ignore for the time being.
With or without relation to the previous warning, it appears that once we try to bring the new resource's Resource Group to the online state, the probe script and start script are being run alternatingly on one node, then on the other (the resource group is "Pending online" during this time), and then the resource group fails.
We have verified that our scripts work correctly for start, stop, probe and validate. Replacing the scripts with simple stubs that simply do
echo $0 >> /tmp/cluster-debug ; exit 0
did not solve the issue.
We've seen conflicting statements in the documentation regarding probe behavior - should the probe verify proper application execution, then return 0 if everything is OK, or should it run constantly as a daemon and return only the application fails?
(see SUNW.gds manpage near the Probe_command property and Sun Cluster Data Services Developer's Guide, page 100, "How the Probe Program Works")
Of course, any help or ideas regarding the constant failovers or proper probe behavior would be greatly appreciated.
TIA,
-- leon
[2612 byte] By [
napobo3a] at [2007-11-26 15:56:29]

# 1
Unfortunatly you did not cite relevant errors from /var/adm/messages. However, I have a guess on whats your problem:
> clresource create -g "my-rg" -t SUNW.gds \
> -p Start_command="/usr/local/scripts/cluster_start_webgui.sh" \
> -p Stop_command="/usr/local/scripts/cluster_stop_webgui.sh" \
> -p Probe_command="/usr/local/scripts/cluster_probe_webgui.sh" \
> -p Validate_command="/usr/local/scripts/cluster_validate_webgui.sh" \
> -p Start_timeout=120 \
> -p Stop_timeout=120 \
> -p Probe_timeout=120 \
> -p Network_resources_used=my-logical-name \
> -p Scalable=false \
> -p Failover_enabled=true \
> -p Child_mon_level=0 \
Why are you setting Child_mon_level=0? This really means that PMF should only monitor about the processes directly started by "/usr/local/scripts/cluster_start_webgui.sh". Now if those processes disappear, PMF would no longer have any process registered with its tag resulting in signaling that the resource failed including a restart.
> -p Stop_signal=9 \
> my-gds-resource
>
> We received the following warning:
>
> [code]clresource: cluster_host1 -
> Current setting of Retry_interval = 370,
> might prevent failover on repeated probe failures.
> It is recommended that Retry_interval be greater than
> or equal to
> [(Thorough_probe_interval + Probe_timeout) * 2 *
> Retry_count].
> Current values are (Thorough_probe_interval =
> 60,Retry_count = 2,Probe_timeout = 120).
>
> Which appeared safe enough to ignore for the time
> being.
You can ignore it, it has nothing to do with the problem yu then see, but in production you should consider the advice.
> With or without relation to the previous warning, it
> appears that once we try to bring the new resource's
> Resource Group to the online state, the probe script
> and start script are being run alternatingly on one
> node, then on the other (the resource group is
> "Pending online" during this time), and then the
> resource group fails.
I guess you will see a message basicly saying "Failed to stay up" from PMF. If so then your Child_mon setup is the problem.
> We have verified that our scripts work correctly for
> start, stop, probe and validate. Replacing the
> scripts with simple stubs that simply do
> echo $0 >> /tmp/cluster-debug ; exit 0
> did not solve the issue.
Sure, because the cited commands will not leave a process for PMF either. All of them will finish immediatly.
> We've seen conflicting statements in the
> documentation regarding probe behavior - should the
> probe verify proper application execution, then
> return 0 if everything is OK, or should it run
> constantly as a daemon and return only the
> application fails?
It should always return a correct return code. It should not run as a daemon. GDS will do that part for you already.
> (see SUNW.gds manpage near the Probe_command property
> and Sun Cluster Data Services Developer's Guide, page
> 100, "How the Probe Program Works")
>
> Of course, any help or ideas regarding the constant
> failovers or proper probe behavior would be greatly
> appreciated.
Read the SUNW.gds manpage and the pmfadm manpage in order to understand the Child_mon_level behaviour. Then examine your Start_command if it leaves processes behind - if so adjust to the proper Child_mon_level. The default (-1) would have monitored any level of childs.
If your application does not leave any process behind after Start_command, you need to disable the pmf action script via pmfadm -s within your Start_commad.
Greets
Thorsten