Cluster Down - Lost Quorum Device

I have a two node cluster with an attached disk for quorum. The systems went down and the device that I had quorum on lost all of its data. Now the systems are sitting at the boot screen saying they cannot get quorum so they won't boot. I tried to do the amnesia recovery, but the system I tried it on kept rebooting itselfwhen it got to the same spot in the boot process. If anyone has any ideas that I could try I would greatly appreciate it. Thanks

[459 byte] By [tfeldmanna] at [2007-11-27 8:45:11]
# 1

Quorum information is written on private cylinders and it is not visiable by Solaris OS , it will work even the device containing quorum has not data on it .

if you try to boot both node without quorum it should work as minium of 2 votes needed to cluster to function ,

Are you able boot nodes with cluster ( boot -sx )

Regards

mtalhaa at 2007-7-12 20:46:14 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 2

The lun that I had quorum on got used in a zfs pool by someone else. The systems won't boot because they are both voting for themselves, and there is nothing else to give them their quorum. The only way that I can get them to boot is as a non-cluster node(boot -x). However, I can not conigure anything because it is not part of the cluster. I attempted to do the amnesia recovery(change the /etc/cluster/ccr/infrastructure file) an now that node will attempt to boot and then keep rebooting itself.

tfeldmanna at 2007-7-12 20:46:14 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 3

If the LUN is inaccessible to your cluster nodes they will not be able to register with the quorum device.

The result is no cluster.

You have to get the quorum device/disk back from the ZFS pool and into a LUN that your cluster hosts can see.

The thing to do is to get the disk in the ZFS pool back, put it in a LUN that the cluster nodes can see, and boot the last node to leave the cluster first to prevent amnesia fencing.

I don't believe that modifying the /etc/cluster files is recommended as it can lead to cluster corruption.

nate_keegana at 2007-7-12 20:46:14 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 4

What happened to you is a typical disaster. So it is hard to recover from it.

Actually, I think the cluster should boot, if both nodes are ok and can talk to each other over the interconnect. They would have a majority of votes. I am surprised to hear that this does not work. Can it be that there is some critical information left on the quorum devices that could prevent this.

This case is not the typical amnesia case, so the procedure that you tried might not be the right one to recover. I think there is a document on sunsolve that describes this procedure, but I am not sure who has access to it. In your case another procedure might help, but I have no time to test anything.

My proposal is to reinstall Sun Cluster. If you have scripted your RG setup, you could very quickly restore your data services, and to setup the cluster initially is also very fast.

And then you should ask your buddies who stole your disk to pay you a drink.

BTW: Why did they think they could use that disk? It seems to be a good practice to have the quorum device on a "used" data disk. This has 2 advantages:

- nobody could think this is an unused disk and

- you get some kind of additional monitoring via the volume manager or application using that disk.

Regards

Hartmut

HartmutSa at 2007-7-12 20:46:14 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 5

The systems are able to see the lun again but can not tell that it is the quorum device initially created for it. I have tried to boot both systems and they can talk to each other, however they get the message at boot, waiting for quorum, and just sit there. I didn't have anything on that lun currently so they thought it was unused. I agree that the easiest step is probably to do the reinstall. I didn't have that many RG's so it won't be an issue. However, if anyone can come up with a recovery method other than re-install I would love to have that knowledge. Thanks.

tfeldmanna at 2007-7-12 20:46:14 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 6

It is possible to put the cluster back into install mode by hacking the CCR. If this is never going to be a production cluster then that's probably OK to do.

In the /etc/cluster/ccr/infrastructure, you can enable the installmode, change the vote count of the second node to 0 and remove the cluster.quorum_devices entries. You'd then need to ccradm -i infrastructure -o (I think) and repeat the procedure on all nodes. (while booted -x). This may allow you to get the cluster back up.

NOTE: Disclaimer - I do not recommend this for any production cluster and I do not even guarantee that the above process is correct or complete as I don't have time to test it.

If you need an official DR procedure, please talk to you local Sun Support staff.

Regards,

Tim

Tim.Reada at 2007-7-12 20:46:14 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...
# 7
Thanks Tim. These are just test boxes so I am not worried about messing them up. Considering it takes a fresh os install to fix it otherwise, this could be a good first attempt. I will give it a try and post my results. thanks again.
tfeldmanna at 2007-7-12 20:46:14 > top of Java-index,Solaris Operating System,Solaris Essentials - General Technical Questions...