Bizarre RMI behavior

hi.

i have a very bizarre anomaly occurring and i am at a loss on where to even begin looking.

i have a client-server application that uses rmi for the network communication. the server is an app that listens for inbound tcp connections and then looks for a registered client that can service the request. clients obtain a remote reference to the server that they use to register a reference to themselves. they also use the remote server ref to put themselves in various availability states.

the server uses the remote client references to signal appropriate clients of some pending work.

the server uses a background 'ping' thread that periodically invokes a lightweight method on each client reference to ping each remote client. if this 'ping' doesn't return in a timely fashion (2sec) the remote reference is purged and the client must re-register with the server.

everything works perfectly for the most part except occasionally the 'ping' will timeout. some clients reside in the same bldg as the server yet others reside in a different state half way across the country. on average, the round-trip time for a 'ping' is around 120ms for remote clients and less than 50ms for clients local to the server.

what i usually find is that the timeout occurs in batches - usually all the remote clients will timeout but not the local suggesting a network issue. however, occasionally a group of remotes will timeout, all the locals will be ok, the next remote will be ok but the last remote will timeout... other times a local client will timeout...

i might add that this server is running on linux and the clients are all running on windows xp. i have another identical client-server setup that is identical except that all these clients also run on linux. at first, my app exhibited the exact same issues with timeouts on the linux-linux system. i searched the java bug database and discovered the issue with reverse dns doing a lookup when it creates an InetAddress. i then configured all of my linux clients to not use dns and the problem went away and i have yet to experience a single timeout since.

i'm blaming it on the network but my network engineer is blaming it on my app. he's run various diags on the network and everything checks out ok (ICMP ping times are fine during the timeout). maybe i should just blame it on windows!

i've tried turning on various rmi logging but there's so much junk in the log i don't know what to look for.

any help would be greatly appreciated!

[2568 byte] By [ten6dsixa] at [2007-11-26 19:14:30]
# 1

The DNS configuration change is all the proof you need that reverse DNS is the culprit. This is probably not a network issue per se in the terms your network engineer is concerned with but rather a DNS configuration issue, which is a netadmin responsibility, or a DNS server software issue, which is the vendor's.

I'd just add that 2 seconds is pretty short for a timeout across a state boundary. Only you know your application's requirements, but I'd consider raising this to at least 5 seconds if you can. Basically any network timeout should be at least twice the total expected service time, including normal latencies in both directions.

ejpa at 2007-7-9 21:15:31 > top of Java-index,Core,Core APIs...
# 2

thanks ejp for the reply.

it turns out that on the windows platform i need dns - i do not on the linux side. i have added the internal ips that the jvm system may be trying to reverse lookup to the the dns server - it made no difference. i then added all the ips to the etc/hosts file of the client - again no help. the server is configured to not use dns. i'm not entirely convinced it is dns - i was at first but if it were then these changes should have solved it - no?

i agree with you that 2 seconds may be a tad on the short side for a timeout across state boundaries - originally this was to all be in the same bldg so 2s was an eternity.

-ten6dsix

ten6dsixa at 2007-7-9 21:15:31 > top of Java-index,Core,Core APIs...
# 3
> i'm not> entirely convinced it is dns - i was at first but if> it were then these changes should have solved it -> no?Only if you did them correctly ;-) Check with nslookup and ping, and take note of the times required.
ejpa at 2007-7-9 21:15:31 > top of Java-index,Core,Core APIs...
# 4
ten6dsix - I've written a small free tool that times and logs the "raw" name service calls (gethostbyname and gethostbyaddr). Maybe it can help you to measure the time your app spends in name lookups. http://www.genady.net/dns/Genady
genadya at 2007-7-9 21:15:31 > top of Java-index,Core,Core APIs...