Bizarre RMI behavior
hi.
i have a very bizarre anomaly occurring and i am at a loss on where to even begin looking.
i have a client-server application that uses rmi for the network communication. the server is an app that listens for inbound tcp connections and then looks for a registered client that can service the request. clients obtain a remote reference to the server that they use to register a reference to themselves. they also use the remote server ref to put themselves in various availability states.
the server uses the remote client references to signal appropriate clients of some pending work.
the server uses a background 'ping' thread that periodically invokes a lightweight method on each client reference to ping each remote client. if this 'ping' doesn't return in a timely fashion (2sec) the remote reference is purged and the client must re-register with the server.
everything works perfectly for the most part except occasionally the 'ping' will timeout. some clients reside in the same bldg as the server yet others reside in a different state half way across the country. on average, the round-trip time for a 'ping' is around 120ms for remote clients and less than 50ms for clients local to the server.
what i usually find is that the timeout occurs in batches - usually all the remote clients will timeout but not the local suggesting a network issue. however, occasionally a group of remotes will timeout, all the locals will be ok, the next remote will be ok but the last remote will timeout... other times a local client will timeout...
i might add that this server is running on linux and the clients are all running on windows xp. i have another identical client-server setup that is identical except that all these clients also run on linux. at first, my app exhibited the exact same issues with timeouts on the linux-linux system. i searched the java bug database and discovered the issue with reverse dns doing a lookup when it creates an InetAddress. i then configured all of my linux clients to not use dns and the problem went away and i have yet to experience a single timeout since.
i'm blaming it on the network but my network engineer is blaming it on my app. he's run various diags on the network and everything checks out ok (ICMP ping times are fine during the timeout). maybe i should just blame it on windows!
i've tried turning on various rmi logging but there's so much junk in the log i don't know what to look for.
any help would be greatly appreciated!

