Sun Cluster 3.2 problems

We have setup a two-node cluster on two v240s running SC 3.2 and Solaris 10 11/06 with a Recommended Patch set from earlier this May.

The cluster was created, we have two interconnects using crossover cables on bge2 and bge3, one 60GB LUN from an IBM SAN setup as a quorum device and for data, bge0 is on our public LAN, and one resource group running Oracle 10gR2 and there are no relavent errors in the logs or from the output of 'sccheck'. I set this up and got the resource group running and started testing failover. We did manual switches a few times without problems. Then we tried to simulate something more random and rebooted the primary cluster node. The resource group seemed to failover cleanly; however, when I tried to switch it back to the first node (which had come up again), we got the error "the resource group is undergoing a reconfiguration, please try again later." I tried to do a quiesce but that just sat there and seemed to do nothing; ended up killing the process. The individual resources began to show up as failed or in various error states. Eventually I was able to get those cleared and everything marked as offline, but

when I try to online the group or switch, I get the same "the resource group is

undergoing a reconfiguration, please try again later." Even after doing a

complete shutdown and boot up of the entire cluster. I also mounted the disk and started the database by hand and it all seemed well so the issue seems to be cluster related, but I can not find any more detailed errors or more information from google or docs.sun.com. Anyone else have any pointers on where to look on the cluster or any other suggestions?

A second problem that concerns me, although it isn't strictly as important, is

that 'cluster shutdown' does not work despite the configuration and the physical setup seeming to be ok. I issued the command and it broadcast a message on both nodes that it would shut down in 60 seconds, but did nothing else. Any ideas?

I put most of this post together yesterday, so I have an update. After 6+ hours the resource group seems to have become available and working again. No actual time on when that occurred, so it could have been closer to 24 hours, and, of course, no information on why. I don't think HA is functioning if it takes that long to become available after a fault. On this particular issue, does anyone know where I can find more detailed information about what may have occurred over the last 24 hours and when and why the group became functional again?

[2583 byte] By [AlanRa] at [2007-11-27 5:26:49]
# 1

> We have setup a two-node cluster on two v240s running

> SC 3.2 and Solaris 10 11/06 with a Recommended Patch

> set from earlier this May.

OK, that's a good start.

> The cluster was created, we have two interconnects

> using crossover cables on bge2 and bge3, one 60GB LUN

> from an IBM SAN setup as a quorum device and for

> data, bge0 is on our public LAN, and one resource

> group running Oracle 10gR2 and there are no relavent

> errors in the logs or from the output of 'sccheck'.

I assume this is HA-Oracle and not Oracle 10g R2 RAC?

If so, do you have a fail-over file system or a global file system. In either case have you configured an HAStoragePlus resource?

> I set this up and got the resource group running and

> started testing failover. We did manual switches a

> few times without problems. Then we tried to

> simulate something more random and rebooted the

> primary cluster node. The resource group seemed to

> failover cleanly; however, when I tried to switch it

> back to the first node (which had come up again), we

> got the error "the resource group is undergoing a

You might have hit the pingpong interval. There is a property call the pingpong interval, which is usually set to 3600 seconds. This is there to stop a resource group bouncing back and forward between nodes. Effectively, it says, if you tried to start here within the last pingpong interval then you can't try again. Check the manual for the exact description as I'm paraphrasing somewhat.

> reconfiguration, please try again later." I tried

> to do a quiesce but that just sat there and seemed

> to do nothing; ended up killing the process. The

> individual resources began to show up as failed or

> in various error states. Eventually I was able to

> get those cleared and everything marked as offline,

> but

> hen I try to online the group or switch, I get the

> same "the resource group is

> undergoing a reconfiguration, please try again

> later." Even after doing a

> complete shutdown and boot up of the entire cluster.

Not sure what is going on here. If you can reproduce the scenario, then it would help to have the /var/adm/message logs from around that time and also enable debugging on the Oracle resource. Check the manual for details on how to do this.

> I also mounted the disk and started the database by

> hand and it all seemed well so the issue seems to be

> cluster related, but I can not find any more

> detailed errors or more information from google or

> docs.sun.com. Anyone else have any pointers on

> where to look on the cluster or any other

> suggestions?

>

>

> A second problem that concerns me, although it isn't

> strictly as important, is

> that 'cluster shutdown' does not work despite the

> configuration and the physical setup seeming to be

> ok. I issued the command and it broadcast a message

> on both nodes that it would shut down in 60 seconds,

> but did nothing else. Any ideas?

Again, it's difficult to know without more details. You might be sitting in a directory that is a mount point it is trying to unmount? That's happened to me.

> I put most of this post together yesterday, so I have

> an update. After 6+ hours the resource group seems

> to have become available and working again. No

> actual time on when that occurred, so it could have

> been closer to 24 hours, and, of course, no

> information on why. I don't think HA is functioning

> if it takes that long to become available after a

> fault. On this particular issue, does anyone know

> where I can find more detailed information about what

> may have occurred over the last 24 hours and when and

> why the group became functional again?

/var/adm/messages?

Tim

Tim.Reada at 2007-7-12 14:47:57 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2

Thanks, Tim, for responding.

I've had a look through the messages file and can't seem to find any indicative messages around my actions as described in the problem; however, I do have a copy of the 'reconfiguration' message I was getting when the resource group was stuck in limbo for the long period:

root@mary # clrg switch -v -n mary epass-db-1

clrg: (C667636) epass-db-1: resource group is undergoing a reconfiguration, try again later

At that point, the group had been failed over to the node Flora in our reboot test.

I tried to switch the group several times over the rest of the day, tried rebooting and switching, etc, and nothing seemed to work. When I had come in the following day it was working again.

To answer some of your other questions: yes, this is HA-Oracle and not RAC, and I am using a HAStoragePlus resource.

Is there a 'force' type option to shutdown? That seems a little problematical if it could be choking on something like sitting in the wrong directory, especially without any messages to that effect.

AlanRa at 2007-7-12 14:47:57 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

Tried 'cluster shutdown' again, and this is what showed on the console. The systems did not shut down.

mary console login: May 28 11:18:46 mary Cluster.Framework: stderr: showmount:

May 28 11:18:46 mary Cluster.Framework: stderr: mary: RPC: Program not registered

May 28 11:19:16 mary Cluster.Framework: stderr: showmount:

May 28 11:19:16 mary Cluster.Framework: stderr: mary: RPC: Program not registered

AlanRa at 2007-7-12 14:47:57 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4

I've solved one issue that might have fixed the 'cluster shutdown' problem. Although I had unique mount point names for the global devices filesystem across the cluster, I had used the same SVM mirror names. I finally saw the 2nd catch in the instructions and renamed my mirrors uniquely. Now, my global devices filesystem is mounting correctly and 'cluster shutdown' works.

I still would like to know why the failover stuck in 'reconfiguration' for so long. It would be nice if there were more logging information available.

AlanRa at 2007-7-12 14:47:57 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5
You can get more debugging information by changing the debug_level property of the HA-Oracle resources. Debug level 9 should get you lots of info. See the manul on how to set this up - make sure you create the required directory too.Tim
Tim.Reada at 2007-7-12 14:47:57 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6

Thanks for the replies, Tim. Is there anyway to increase the logging on the cluster itself?

Anyways, as the usual course, I think I've solved all of my problems mostly on my own. Seemingly independent of these cluster issues, we had a multipathing issue with our SAN disk and a call opened with Sun on that. It took them a week to tell us to switch from the IBM drivers/software to the Sun product. In the course of reconfiguring SC to use the new device file name for the SAN disk, I found and fixed some cluster configuration issues.

The global devices issue was a real PITA, but I think it helped with a lot of my problems. Several reconfiguration reboots and sacrifices-to-the-gods later, I had SC recognizing the new device names and everything seems to be running well (so far). Even survived some catastrophic testing.

AlanRa at 2007-7-12 14:47:57 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 7
I gather that there are a few internal debug settings, but there are mainly for support people to use. I've seen them mentioned in various emails, but have never used them myself. May be one of my colleagues might pipe up here...Tim
Tim.Reada at 2007-7-12 14:47:57 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...