RMI ConnectIOException after NIC port goes down

The system I've build consists of 5 servers running on physically different machines. Each server has a number of RMI objects registered in the RMI registry. Each machine can also run an app that acts as a client to one of the servers. This works well for quite some time.

But recently on of the routers in our network had a power-down. About 5 minutes later the router was powered on. The servers were still running. To see if all was well I've used one of the clients to, in sequence, connect to the servers. Four of the five servers connected and I was able to access the RMI server objects from my client. However for the 5th server the client got a java.rmi.ConnectIOException with a message stating "error during JRMP connection establishment" and the nested exception is a SocketException. This happens when I do a Naming.lookup() of the object on in the servers RMI registry. Running the client on the same machine as the server works fine though...

We've reproduced this problem by either powering down the router or by pulling the networkcables from the servers. The thing is that not always the same server is giving the problem and that in some cases more than one show this behavior.

The server does not give an exception of any kind that gives me a hint what is going on.

Anyone seen such behavior before? Is there a way to detect the NIC port going down? Can RMI registries survive a NIC port going down? Any other ideas what might fail (and possibly a suggestion what can be done to prevent/resolve this)?

Thanks,

Vincent Hartsteen

[1595 byte] By [hartsteena] at [2007-10-3 6:29:03]
# 1
The randomness is due to RMI client connection pooling.IIRC ConnectIOException arises when trying to reuse a pooled connection.The Registry can survive an outage but existing pooled connections may not.
ejpa at 2007-7-15 1:15:49 > top of Java-index,Core,Core APIs...
# 2
I might be missing your point but the clients were not running at the time of the outage... I've started the clients afterwards to check if the servers we're still accessible. The clients do the Naming.lookup() to get a fresh reference.
hartsteena at 2007-7-15 1:15:49 > top of Java-index,Core,Core APIs...
# 3
oops, missed that. What was the nested SocketException?
ejpa at 2007-7-15 1:15:49 > top of Java-index,Core,Core APIs...
# 4

This is the main part of the strack-trace:

java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:

java.net.SocketException: Connection reset

at sun.rmi.transport.tcp.TCPChannel.createConnection(Unknown Source)

at sun.rmi.transport.tcp.TCPChannel.newConnection(Unknown Source)

at sun.rmi.server.UnicastRef.newCall(Unknown Source)

at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)

at java.rmi.Naming.lookup(Unknown Source)

.

.

Deleted some lines

.

.

Caused by: java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(Unknown Source)

at java.io.BufferedInputStream.fill(Unknown Source)

at java.io.BufferedInputStream.read(Unknown Source)

at java.io.DataInputStream.readByte(Unknown Source)

I've deleted some lines to make the stack-trace a bit shorter. The Naming.lookup() is called from my client-code.

Thanks,

Vincent

hartsteena at 2007-7-15 1:15:49 > top of Java-index,Core,Core APIs...
# 5

In strict theory you have to be prepared to repeat any RMI call a reasonable number of times. What is reasonable may vary from zero upwards depending on what the exception was of course, and whether the method is idempotent or not. This is a case where a single retry wouldn't hurt on any RemoteException. If you actually get NotBoundException you would probably do something more drastic.

ejpa at 2007-7-15 1:15:49 > top of Java-index,Core,Core APIs...
# 6

Thanks for your input.

The retrying was done manually by closing the client and starting it all over. I can do that over a long period of time without success. I'll build in a retry loop at the point where I do the Naming.lookup() though. Thanks for the suggestion.

Occasionally we also see the NotBoundException for the object that we're looking up after the router is powered down (identical scenario as described in my original posting). Does that imply that the RMI registry on the server crashed? The only drastic measure we've taken sofar is to restart our server but we prefer a bit less drastic measure by letting the server detect this situation an recover from that autonomously. Step one is to be able to let the server detect that its registry crashed. Is periodically using LocateRegistry.getRegistry() the means to do that or is there a better way?

Vincent

hartsteena at 2007-7-15 1:15:49 > top of Java-index,Core,Core APIs...
# 7
Periodically using LocateRegistry.getRegistry() does nothing actually: it doesn't imply any I/O of its own. It wll succeed even if the hostname you supply doesn't exist AFAIK.Registry.list() is the quickest thing you can do to the Registry to see if it's there.
ejpa at 2007-7-15 1:15:49 > top of Java-index,Core,Core APIs...