Sun Cluster 3.2 problems
We have setup a two-node cluster on two v240s running SC 3.2 and Solaris 10 11/06 with a Recommended Patch set from earlier this May.
The cluster was created, we have two interconnects using crossover cables on bge2 and bge3, one 60GB LUN from an IBM SAN setup as a quorum device and for data, bge0 is on our public LAN, and one resource group running Oracle 10gR2 and there are no relavent errors in the logs or from the output of 'sccheck'. I set this up and got the resource group running and started testing failover. We did manual switches a few times without problems. Then we tried to simulate something more random and rebooted the primary cluster node. The resource group seemed to failover cleanly; however, when I tried to switch it back to the first node (which had come up again), we got the error "the resource group is undergoing a reconfiguration, please try again later." I tried to do a quiesce but that just sat there and seemed to do nothing; ended up killing the process. The individual resources began to show up as failed or in various error states. Eventually I was able to get those cleared and everything marked as offline, but
when I try to online the group or switch, I get the same "the resource group is
undergoing a reconfiguration, please try again later." Even after doing a
complete shutdown and boot up of the entire cluster. I also mounted the disk and started the database by hand and it all seemed well so the issue seems to be cluster related, but I can not find any more detailed errors or more information from google or docs.sun.com. Anyone else have any pointers on where to look on the cluster or any other suggestions?
A second problem that concerns me, although it isn't strictly as important, is
that 'cluster shutdown' does not work despite the configuration and the physical setup seeming to be ok. I issued the command and it broadcast a message on both nodes that it would shut down in 60 seconds, but did nothing else. Any ideas?
I put most of this post together yesterday, so I have an update. After 6+ hours the resource group seems to have become available and working again. No actual time on when that occurred, so it could have been closer to 24 hours, and, of course, no information on why. I don't think HA is functioning if it takes that long to become available after a fault. On this particular issue, does anyone know where I can find more detailed information about what may have occurred over the last 24 hours and when and why the group became functional again?
[2583 byte] By [
AlanRa] at [2007-11-27 5:26:49]

# 1
> We have setup a two-node cluster on two v240s running
> SC 3.2 and Solaris 10 11/06 with a Recommended Patch
> set from earlier this May.
OK, that's a good start.
> The cluster was created, we have two interconnects
> using crossover cables on bge2 and bge3, one 60GB LUN
> from an IBM SAN setup as a quorum device and for
> data, bge0 is on our public LAN, and one resource
> group running Oracle 10gR2 and there are no relavent
> errors in the logs or from the output of 'sccheck'.
I assume this is HA-Oracle and not Oracle 10g R2 RAC?
If so, do you have a fail-over file system or a global file system. In either case have you configured an HAStoragePlus resource?
> I set this up and got the resource group running and
> started testing failover. We did manual switches a
> few times without problems. Then we tried to
> simulate something more random and rebooted the
> primary cluster node. The resource group seemed to
> failover cleanly; however, when I tried to switch it
> back to the first node (which had come up again), we
> got the error "the resource group is undergoing a
You might have hit the pingpong interval. There is a property call the pingpong interval, which is usually set to 3600 seconds. This is there to stop a resource group bouncing back and forward between nodes. Effectively, it says, if you tried to start here within the last pingpong interval then you can't try again. Check the manual for the exact description as I'm paraphrasing somewhat.
> reconfiguration, please try again later." I tried
> to do a quiesce but that just sat there and seemed
> to do nothing; ended up killing the process. The
> individual resources began to show up as failed or
> in various error states. Eventually I was able to
> get those cleared and everything marked as offline,
> but
> hen I try to online the group or switch, I get the
> same "the resource group is
> undergoing a reconfiguration, please try again
> later." Even after doing a
> complete shutdown and boot up of the entire cluster.
Not sure what is going on here. If you can reproduce the scenario, then it would help to have the /var/adm/message logs from around that time and also enable debugging on the Oracle resource. Check the manual for details on how to do this.
> I also mounted the disk and started the database by
> hand and it all seemed well so the issue seems to be
> cluster related, but I can not find any more
> detailed errors or more information from google or
> docs.sun.com. Anyone else have any pointers on
> where to look on the cluster or any other
> suggestions?
>
>
> A second problem that concerns me, although it isn't
> strictly as important, is
> that 'cluster shutdown' does not work despite the
> configuration and the physical setup seeming to be
> ok. I issued the command and it broadcast a message
> on both nodes that it would shut down in 60 seconds,
> but did nothing else. Any ideas?
Again, it's difficult to know without more details. You might be sitting in a directory that is a mount point it is trying to unmount? That's happened to me.
> I put most of this post together yesterday, so I have
> an update. After 6+ hours the resource group seems
> to have become available and working again. No
> actual time on when that occurred, so it could have
> been closer to 24 hours, and, of course, no
> information on why. I don't think HA is functioning
> if it takes that long to become available after a
> fault. On this particular issue, does anyone know
> where I can find more detailed information about what
> may have occurred over the last 24 hours and when and
> why the group became functional again?
/var/adm/messages?
Tim
# 2
Thanks, Tim, for responding.
I've had a look through the messages file and can't seem to find any indicative messages around my actions as described in the problem; however, I do have a copy of the 'reconfiguration' message I was getting when the resource group was stuck in limbo for the long period:
root@mary # clrg switch -v -n mary epass-db-1
clrg: (C667636) epass-db-1: resource group is undergoing a reconfiguration, try again later
At that point, the group had been failed over to the node Flora in our reboot test.
I tried to switch the group several times over the rest of the day, tried rebooting and switching, etc, and nothing seemed to work. When I had come in the following day it was working again.
To answer some of your other questions: yes, this is HA-Oracle and not RAC, and I am using a HAStoragePlus resource.
Is there a 'force' type option to shutdown? That seems a little problematical if it could be choking on something like sitting in the wrong directory, especially without any messages to that effect.
AlanRa at 2007-7-12 14:47:57 >
