HADB database becomes corrupted

Hello,

I am running Enterprise App Server 8.1 in a 2 node cluster with a 2 node HADB database. If both machines reboot for some reason the HADB database becomes corrupted and I have to clear it and run the configure cluster asadmin command again in order to get it back online. Is there a way to stop this from happening?

[334 byte] By [markbrewster] at [2007-11-26 6:59:40]
# 1
Hi,In the case of a double failure the only solution as you say is to clear the existing data and configure again.Regards,Anitha
anithagopi at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 2
What do people normally do in a production environment to handle this?
markbrewster at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 3

HADB is designed to sustain the failure of one node (or one node in a mirrored pair), but does not survive a double node failure. To ensure (almost) continuous availability, the computers used for HADB must have independent failure modes, and should have independent battery-backed power supplies. If you need to shut down both HADB computers, the HADB database should be stopped first.

Roy

roylyseng at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 4

I'm not sure what you mean by independent failure modes, but I think it means both machines don't depend on each other and if one fails it doesn't take the other down. Unfortunately, we're using Sun Cluster to cluster the machines and it panics once in a while and reboots both machines simultaneously without warning and then we have a corrupted HADB database. Would it corrupt the database if we had, say a rc3.d script, that did a hadbm stopnode on each machine as it was going down? It wouldn't stop the database but it would stop each node.

markbrewster at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 5
Sorry, but no. A stopnode is actually very similar to an uncontrolled stop. In both cases, when the node is restarted it will do a recovery or a repair from its mirror node.
roylyseng at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 6
Panic of one node shouldn't bring down the other unless there operational quorum was lost.Was the HADB database able to come online on the second node before the problem? The node could have rebooted if the failure mode is hard.
MadhanKB at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 7

Ya, the quorum was lost which is what made them both panic. We found out we aren't suppose to share a quorum device with another Sun Cluster cluster after the fact. I think that got us seeing these states where we'd lose quorum and the nodes would panic cause they didn't know what was going on. That would in turn corrupt the hadb database. Both nodes were online prior to the servers panicing. We do have the failure mode set to hard, I don't really know what all the options are, or how to change them. Maybe there would be a better setting that wouldn't force the whole cluster to reboot.

markbrewster at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 8

You can check the sun cluster admin guide for the details. you can modify the property by scrgadm -cj <res-name> -x Failover_mode=SOFT.

Please note that however this would not prevent the cluster from restarting nodes if operational quorum is lost. It is not advisable to change quorum related settings without consulting with Sun Engineers.

MadhanKB at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...
# 9
Just to make my ambiguous statements clear. 1.) The node reboot had not occured due to Failover_mode.2.) the command was just a FYI message and *not* a solution to the reboot issue.3.) as stated please consult sun for advise.
MadhanKB at 2007-7-6 15:38:02 > top of Java-index,Application & Integration Servers,Application Servers...