why retry_count property of resource is disable for failover
SunCluster 3.1
I set the property of a resource as
start_timeout--60
retry_count--5
retry_interval--300
the wanted behavior is start the resource and every time waiting for it 60 seconds, if failed retry, the total failover time is 300 seconds
but the current behavior is:
retry twice then fail over, check the log:
"because this resource group has already failed to start on this node 2 or more times in the past 3600 seconds"
what that means?
I check it, it seems like the property of theresource type controls the failover,
if I want to set a total failover retry_count and retry_interval, where should I set the property?
[707 byte] By [
shock_ua] at [2007-11-27 7:55:28]

# 1
Hi,
You are seeing the message due to the Pingpong_interval. It stops a resource from continuously failing between nodes. Basically if a resource has failed on a node and switched, the resource won't be allowed to switch back until the Pingpong_interval has expired (default 1 hour).
If you are testing you can temporarily reduce the property by:
scrgadm -c -g <resource group name> -y Pingpong_interval=60
Once testing is finished then set back to default:
scrgadm -c -g <resource group name> -y Pingpong_interval=3600
# 2
I know the pingpong_interval, but why just twice, where can I set times of retry.
and I have tried set the pingpong_interval to 60, but it cause continuously retry, even it meet the retry_interval(300) which property I set for the resource.
which property control the total retry_interval?
Can I configure the retry_count for pingpong_interval?
thanks
# 3
Do you know why your resource is failing? Does it ever successfully go online?
I think if your resource does not start and go online then the retry count / interval properties are not used.
The way to test your resource is first to make sure it goes online. Then induce some failure e.g. killing a process. You should see the resource restart. Repeating this will cause failover only after retry_count failures.
If the resource does not come online then fix it before continuing.
# 4
The failure of start may be caused by lack of resource or waiting for other process, but the designed behavior is :
1. retry 5 times, every start_timeout is 60, total retry_interval is 300
But the question is the Cluster allows the resource retry twice.
I don't what restart means, a resource timeout in starting, SunCluster restart it, but SunCluster also count it as fails once.
Do you means, increase the start_timeout and make "restart" inside the application? But we designed using SunCluster to restart failed application.
# 5
now the process
1.resource started by Cluser
2.prober found it is not ok
3.start timeout
4.restart
...
5.retried twice, Not attempting to start resource group , because this resource group has already failed to start on this node 2 or more times in the past 3600 seconds
6.failover
the status change
offline-->starting unknow-->stopping-->starting unknow-->stoping-->offline
# 6
So your application never successfully starts. What is the application and what Sun Cluster agent are you using?
You should disable the resource in cluster and then check that it starts manually on BOTH nodes. If it is a 'custom agent' then also check the probe command / script.
Once the application is stable then you should re-enable resource in cluster.
# 7
This is our own application, we write the prober for it, we are testing it, and we want it to retry at least 5 times before it failover to the other node,
But SunCluster just retried twice before fail over.
So the problem is not we can not start the application, but we want the SunCluster retried more times before it fail over.
# 8
Before a resource can *re*start, it need to successfully start in the first place.
The restart is then upon failure detected by the fault monitor.
So if I understand your case, the resource never starts successfully, is that correct?
That means it tries to start on the first node, fails, tries on the second node, fails, and then the ping-pong intervall kicks in since there is no point to try to start on a node where it failed very recently already.
The Retry_count is ony messured after the resource started.
So make sure your resource can start successfully, then introduce faults like killing it, so the probe detects failure. Then you should see the configured parameters apply.
Greets
Thorsten
# 9
Very nice piece of information .... Thankyou.. It helped me to understand one of the issues i worked sometime back.