RMI ConnectIOException after NIC port goes down
The system I've build consists of 5 servers running on physically different machines. Each server has a number of RMI objects registered in the RMI registry. Each machine can also run an app that acts as a client to one of the servers. This works well for quite some time.
But recently on of the routers in our network had a power-down. About 5 minutes later the router was powered on. The servers were still running. To see if all was well I've used one of the clients to, in sequence, connect to the servers. Four of the five servers connected and I was able to access the RMI server objects from my client. However for the 5th server the client got a java.rmi.ConnectIOException with a message stating "error during JRMP connection establishment" and the nested exception is a SocketException. This happens when I do a Naming.lookup() of the object on in the servers RMI registry. Running the client on the same machine as the server works fine though...
We've reproduced this problem by either powering down the router or by pulling the networkcables from the servers. The thing is that not always the same server is giving the problem and that in some cases more than one show this behavior.
The server does not give an exception of any kind that gives me a hint what is going on.
Anyone seen such behavior before? Is there a way to detect the NIC port going down? Can RMI registries survive a NIC port going down? Any other ideas what might fail (and possibly a suggestion what can be done to prevent/resolve this)?
Thanks,
Vincent Hartsteen
[1595 byte] By [
hartsteena] at [2007-10-3 6:29:03]

This is the main part of the strack-trace:
java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
java.net.SocketException: Connection reset
at sun.rmi.transport.tcp.TCPChannel.createConnection(Unknown Source)
at sun.rmi.transport.tcp.TCPChannel.newConnection(Unknown Source)
at sun.rmi.server.UnicastRef.newCall(Unknown Source)
at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
at java.rmi.Naming.lookup(Unknown Source)
.
.
Deleted some lines
.
.
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.readByte(Unknown Source)
I've deleted some lines to make the stack-trace a bit shorter. The Naming.lookup() is called from my client-code.
Thanks,
Vincent
Thanks for your input.
The retrying was done manually by closing the client and starting it all over. I can do that over a long period of time without success. I'll build in a retry loop at the point where I do the Naming.lookup() though. Thanks for the suggestion.
Occasionally we also see the NotBoundException for the object that we're looking up after the router is powered down (identical scenario as described in my original posting). Does that imply that the RMI registry on the server crashed? The only drastic measure we've taken sofar is to restart our server but we prefer a bit less drastic measure by letting the server detect this situation an recover from that autonomously. Step one is to be able to let the server detect that its registry crashed. Is periodically using LocateRegistry.getRegistry() the means to do that or is there a better way?
Vincent