HADB database becomes corrupted
Hello,
I am running Enterprise App Server 8.1 in a 2 node cluster with a 2 node HADB database. If both machines reboot for some reason the HADB database becomes corrupted and I have to clear it and run the configure cluster asadmin command again in order to get it back online. Is there a way to stop this from happening?
# 3
HADB is designed to sustain the failure of one node (or one node in a mirrored pair), but does not survive a double node failure. To ensure (almost) continuous availability, the computers used for HADB must have independent failure modes, and should have independent battery-backed power supplies. If you need to shut down both HADB computers, the HADB database should be stopped first.
Roy
# 4
I'm not sure what you mean by independent failure modes, but I think it means both machines don't depend on each other and if one fails it doesn't take the other down. Unfortunately, we're using Sun Cluster to cluster the machines and it panics once in a while and reboots both machines simultaneously without warning and then we have a corrupted HADB database. Would it corrupt the database if we had, say a rc3.d script, that did a hadbm stopnode on each machine as it was going down? It wouldn't stop the database but it would stop each node.
# 5
Sorry, but no. A stopnode is actually very similar to an uncontrolled stop. In both cases, when the node is restarted it will do a recovery or a repair from its mirror node.
# 7
Ya, the quorum was lost which is what made them both panic. We found out we aren't suppose to share a quorum device with another Sun Cluster cluster after the fact. I think that got us seeing these states where we'd lose quorum and the nodes would panic cause they didn't know what was going on. That would in turn corrupt the hadb database. Both nodes were online prior to the servers panicing. We do have the failure mode set to hard, I don't really know what all the options are, or how to change them. Maybe there would be a better setting that wouldn't force the whole cluster to reboot.
# 8
You can check the sun cluster admin guide for the details. you can modify the property by scrgadm -cj <res-name> -x Failover_mode=SOFT.
Please note that however this would not prevent the cluster from restarting nodes if operational quorum is lost. It is not advisable to change quorum related settings without consulting with Sun Engineers.