HADB corruption + recovery

We've had a situation occur where a dual-node HADB installation has become corrupted, but is refusing to be cleared/deleted when the servers have been rebooted.

After rebooting both instances, the ma agent processes have come back online. Problem is, the database appears corrupted and the usual procedure to reinitialize (clear) it isn't working, as both nodes complain that the clear can't be performed as the HADB is about to undergo recovery:

$ ./hadbm status test

hadbm:Error 22012: The management agent at host localhost is not ready to execute the operation, since it is about todo repository recovery. Please make sure that a majority of the management agents in the domain are running, and retry the operation later.

$ ./hadbm clear test

Please enter the passwordfor the database system user:*********

Please retype the passwordfor database system user:*********

WARNING: The --dbpassword option is deprecated since it is insecure. Usingthis option can compromise your password. Please use either the command prompt or the --dbpasswordfile option.

hadbm:Error 22012: The management agent at host localhost is not ready to execute the operation, since it is about todo repository recovery. Please make sure that a majority of the management agents in the domain are running, and retry the operation later.

Trouble is, the recovery -never- happens. It seems eternally stuck in this state, and I'm not sure what to do next. Do I manually need to blow away the devices, history and configuration files, after stopping the ma processes? It seems like a pretty poor solution, or a bug in HADB that should be fixed.

For reference, both nodes are running HADB 4.4.2-20 on Solaris 10 (SPARC).

[1942 byte] By [tourtecha] at [2007-11-26 22:52:48]
# 1
Hi,I encountered the same problem on Solaris 10 x86. Directly after creating the database. It seemed to have someting to do with resources, after using a smaller logbuffer size, a was able to run the hadb nodes and use it for session failover.
robert@javixa at 2007-7-10 12:15:41 > top of Java-index,Application & Integration Servers,Application Servers...
# 2

Problem found. It was just a poorly worded error message which was the red herring.

The reason that the recovery didn't ever happen was because somebody changed the host's primary network interface. This caused the management agent on one host to no longer bind to the correct interface (after reboot)- even though the interface it should have used was still present.

It took a while to find - but eventually I discovered in the ma's logfile - that it was listening on the wrong ip address.

It would have been helpful if the hadbm error message was more accurate - and specifying that the management agent could not be contacted - not that the repository recover was about to happen.

As soon as I changed the management agent to bind to the correct subnet, the repositories became available and I could issue an "hadbm clear" command.

D'oh!

tourtecha at 2007-7-10 12:15:41 > top of Java-index,Application & Integration Servers,Application Servers...