Method invocation READ TIME OUT after many previous succesful invocations

Hi, I have problem with following error:

error during JRMP connection establishment; nested exception is:

java.net.SocketTimeoutException: Read timed out

java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:

java.net.SocketTimeoutException: Read timed out

at sun.rmi.transport.tcp.TCPChannel.createConnection(Unknown Source)

at sun.rmi.transport.tcp.TCPChannel.newConnection(Unknown Source)

at sun.rmi.server.UnicastRef.invoke(Unknown Source)

at cz.i2s.heri.Heri_Stub.getPages(Heri_Stub.java:85)

at cz.i2s.task.discovery.SourceDiscovery.work(SourceDiscovery.java:678)

at cz.i2s.task.AbstractTask.run(AbstractTask.java:169)

at cz.i2s.daemon.RunningTask$1.run(RunningTask.java:49)

at java.lang.Thread.run(Unknown Source)

Caused by: java.net.SocketTimeoutException: Read timed out

at java.net.SocketInputStream.socketRead0(Native Method)

at java.net.SocketInputStream.read(Unknown Source)

at java.io.BufferedInputStream.fill(Unknown Source)

at java.io.BufferedInputStream.read(Unknown Source)

at java.io.DataInputStream.readByte(Unknown Source)

... 8 more

It occurs unpredictably after a long run (days) of our application. We call on the server some "get" method in the cycle and this fails. Then on application ending a clean up method is invoked on the server and the invocation is fine again although the previous error. So I tried to retry the problematic "get" method call if this error occurred but it fails too and again, following finalization connect the server without the problem.

According the log, the server side is not contacted and there is no synchronization which could prolonged tre response.

I was searching the archives and I have found only this http://forum.java.sun.com/thread.jspa?threadID=718211&messageID=4148111

relevant but there is no helping response.

Does anybody know what can cause strange behavior like this or have some hint what to do?

[2067 byte] By [_sherpa_a] at [2007-11-26 19:56:49]
# 1
I sure hope you are not cleaning anything up during finalization... Yea, that would be unpredictable.Is your client trying to hold open a 'connection' to the server? or do you establish a new one when you want to communicate?
_dnoyeBa at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 2
We are not using any finalize blocks so this is not that case.And I'm not sure how it's with connection reusing - we are using the simple RMI pattern only - we are not trying to modify somehow the default RMI connection handling.
_sherpa_a at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 3
there is no RMI pattern.When are you obtaining the reference to the server? just once during lifetime of the client? or every time before you are ready to use it?
_dnoyeBa at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 4

The reference is obtained only once at the beginning. To be more precise, the code looks like this:

server = (MyServer) Naming.lookup("rmi://" + host + "/" + SOME_NAME);

...

try {

while (true) {

Data data;

try {

data = server.getData();//once after longer time, this fails with described exception

} catch (RemoteException e) {

Thread.sleep(1000*5);

data = server.getData(); //then, this fails too with the same ex.

}

if (data == null && !server.isRunning()) {

break;

}

}

} catch (RemoteException e) {

//

}

Info info = server.finish(); //but this is fine! The connection to server is successful and I did nothing else before this call

_sherpa_a at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 5

If it were me, unless there is a reason, I would not hold the server reference that long. I would simply re-get it each time I am ready to use it. Store it for perhaps 5 minutes max, and dump it if it takes longer than that.

its possible the connection has been broken during your wait for one reason or another. The connection should not break, but it certainly can. Try turning on some of the debugging features and see if you can tell when the connection is dropped.

Note: The server will drop the connection if there is no activity on it for 10 minutes. But this is not your responsibility to renew. The RMI architecture automatically renews the lease every 5 minutes by default, so it should not drop. but if something happens and the client can not, then you will get this error.

you can increase the property java.rmi.dgc.leaseValue to something beyond 10 minutes. But that is a workaround and like I said i would just recreate the connection.

on the server set the property sun.rmi.dgc.logLevel = VERBOSE

and see if you can tell when the server drops the connection. or maybe sun.rmi.transport.logLevel =verbose

as i said my basic strategy would be

...

try {

while (true) {

server = (MyServer) Naming.lookup("rmi://" + host + "/" + SOME_NAME);

Data data;

try {

data = server.getData();//once after longer time, this fails with described exception

} catch (RemoteException e) {

Thread.sleep(1000*5);

data = server.getData(); //then, this fails too with the same ex.

}

if (data == null && !server.isRunning()) {

break;

}

}

} catch (RemoteException e) {

//

}

Info info = server.finish(); //but this is fine! The connection to server is successful and I did nothing else before this call

its a network program so you should expect the connection to come up broken every now and then. try not to end the program when this happens but create a retry strategy for yourself.

_dnoyeBa at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 6

Thank you for the help. I will try your suggestions. I have already started to experimenting with the logs (but with transport log only - I didn't think about the lease property) and now I'm waiting for the error to occurs.

The strangest thing for me is why the server.getData() call fails two times (this was my naive attempt for retry strategy) and the immediate server.finish call is fine. I was expecting that the first failure will cause the connection renewal or something like this. And if not the 'finish' call should fail too... But it's more complicated somehow. You are right - I will try to re-get the server reference.

And one note - maybe it could be relevant - in my case, the host is localhost most of the time.

One more - thanks.

_sherpa_a at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 7

well a better retry strategy would be to reconnect on remote exception. thats what I use. but I also never fail. i just keep trying. but the user inspires my attempts so they know whats going on.

server = (MyServer) Naming.lookup("rmi://" + host + "/" + SOME_NAME);

try {

while (true) {

Data data;

try {

data = server.getData();//once after longer time, this fails with described exception

} catch (RemoteException e) {

Thread.sleep(1000*5);

server = (MyServer) Naming.lookup("rmi://" + host + "/" + SOME_NAME);

data = server.getData(); //then, this fails too with the same ex.

}

if (data == null && !server.isRunning()) {

break;

}

}

} catch (RemoteException e) {

//

}

Info info = server.finish(); //but this is fine! The connection to server is successful and I did nothing else before this call

i dont know why finish would work. Can you tell me the exact type of RemoteException being thrown? Also how do you know finish is working? It does not seem to throw remote exception.

_dnoyeBa at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 8

There's a confusion here between:

(a) RMI's DGC lease on the remote stub, which is formed when the client receives the stub and is renewed every 5 minutes: see http://java.sun.com/j2se/1.5.0/docs/guide/rmi/faq.html#leases and http://java.sun.com/j2se/1.5.0/docs/guide/rmi/faq.html#leases2

and

(b) the TCP connection instantaneously in use between any given client thread and the server, which is formed when the first RMI call to that host:port goes through from any stub; which can be re-used across different stubs and threads once it becomes idle; and which is closed after 15 seconds of idleness at the client. See http://java.sun.com/j2se/1.5.0/docs/guide/rmi/faq.html#connection

If the OP's error had anything to do with DGC and leases, the symptom would be a NoSuchObjectException, not a connection failure. The OP's problem is concerned with (b). I don't have an explanation or a workaround for it but it is seen from time to time.

Maybe increasing sun.rmi.transport.connectionTimeout at the client from the default of 15000 (15 seconds) to somewhere beyond the normal interval between these calls would mitigate it.

ejpa at 2007-7-9 22:51:11 > top of Java-index,Core,Core APIs...
# 9

Yes I was just considering that. And Noticed it was a ConnectIOException being thrown. But I don't think increasing the connection timeout will help as his failure occurs after 'days' of inactivity.

Anyway, certainly its better not to maintain a DGC lease for days. If he changes that perhaps a different exception will be thrown that sheds light on what is really happening.

Since this is days it could be a momentary DNS failure or the host IP could be changing or something like that.

_dnoyeBa at 2007-7-9 22:51:12 > top of Java-index,Core,Core APIs...
# 10

I don't think it could be caused by network failure because as I said, these errors happened while both client and server were on the same host and communicate locally.

As you were asking for exact type of RemoteException - unfortunately I don't know it now. The stack trace is all I have now. Secondly, I'm think that finish is the sane case because it's also method retrieving data from server (and throwing remoteException) and I can see them in the log.

Finally, some other observation: This error occurs when the server is running more intensively then client and the server allocate a lot of memory. The program goes like client call getData metod on server periodically in interval from 1 to 10 minutes, there are no longer pauses.

_sherpa_a at 2007-7-9 22:51:12 > top of Java-index,Core,Core APIs...
# 11

You can definitely have DNS failure on a single computer if you are accessing it by name and that name is not in the local DNS file. Are you on windows or Linux?

Your exception is ConnectIOException. Its in the stack trace. I just did not notice it at first. That is a type of RemoteException.

Are you doing the TCP level logging as he suggested now? And you can drop the logging of the DGC since that is not the issue here.

I think definitely set the sun.rmi.transport.connectionTimeout on the CLIENT higher. Also you may look at setting sun.rmi.transport.tcp.handshakeTimeout higher on the CLIENT as well.

Either way, its more of an indicaiton that the server is probably blocking. Are you doign anything on the server that can either block, or more likely, race and eat up all the CPU for a while? Also are you using a parallel GC on the server?

I suspect the server is becoming busy for some reason and is not responding to the client in a timely fashion. If that is the case, the it makes sense that finish always completes because you only hit finish if you experience errors in your while loop anyway, and those errors would be temporary.

_dnoyeBa at 2007-7-9 22:51:12 > top of Java-index,Core,Core APIs...
# 12

> But I don't think

> increasing the connection timeout will help as his

> failure occurs after 'days' of inactivity.

Well he didn't say anything about 'inactivity', just 'days'. But I agree, it's not a connection timeout issue, it's a handshake timeout issue, so your sun.rmi.transport.handshakeTimeout suggestion is a good one.

> Anyway, certainly its better not to maintain a DGC

> lease for days. If he changes that perhaps a

> different exception will be thrown that sheds light

> on what is really happening.

DGC has nothing to do with it.

> Since this is days it could be a momentary DNS

> failure or the host IP could be changing or something

> like that.

It is a read timeout doing the initial RMI handshake. So the TCP connection has already been established, so the problem has nothing to do with IP addresses or DNS. Note that by the stack trace it is a new connection, not one being reused.

I think your suggestion about something in the server hogging the CPU at the server is closest to the mark here. There's nothing else I can see that would block the server's ability to participate in the handshake, except maybe a temporary network outage, e.g. a router reboot somewhere.

ejpa at 2007-7-9 22:51:12 > top of Java-index,Core,Core APIs...
# 13

> Well he didn't say anything about 'inactivity', just

> 'days'. But I agree, it's not a connection timeout

> issue, it's a handshake timeout issue, so your

> sun.rmi.transport.handshakeTimeout suggestion is a

> good one.

Your right. I got a first impression and its hard to shake it. But in the previous post he says no more than 10 minutes of inactivity.

>

> > Anyway, certainly its better not to maintain a DGC

> > lease for days. If he changes that perhaps a

> > different exception will be thrown that sheds

> light

> > on what is really happening.

>

> DGC has nothing to do with it.

>

OK. Im thinking now that maybe he can log the server GC. Since this happens after days there is a chance this is the GC running hogging the CPU. _sherpa_ can you tell us your GC settings?

> It is a read timeout doing the initial RMI handshake.

> So the TCP connection has already been established,

> so the problem has nothing to do with IP addresses or

> DNS. Note that by the stack trace it is a new

> connection, not one being reused.

This is beyond my understanding of RMI. You get a RMI handshake before each communication? What in the trace tells you its a new connection?

Thanks.

>

> I think your suggestion about something in the server

> hogging the CPU at the server is closest to the mark

> here. There's nothing else I can see that would block

> the server's ability to participate in the handshake,

> except maybe a temporary network outage, e.g. a

> router reboot somewhere.

Also, it could be some other process on the server computer that runs routinely. Is it a windows or Linux box?

_dnoyeBa at 2007-7-9 22:51:12 > top of Java-index,Core,Core APIs...
# 14
sun.rmi.transport.tcp.TCPChannel.createConnection(Unknown Source) shows that it's still forming a new logical TCP connection.The RMI handshake is documented in the RMI Specification.
ejpa at 2007-7-9 22:51:12 > top of Java-index,Core,Core APIs...
# 15

Sorry for delayed response, but I was off...

It's running on Linux.

The GC setting is not changed. I'm using the default RMI properties.

The host name was specified as localhost, so if I understand it right, ale communication is running on socket.

I've change the sun.rmi.transport.tcp.handshakeTimeout to 10 min (from 1min) and the problematic task is still running, now for more than 5 days (it's more than ever before). So the reason for problem is not answered but this could be the workaround for it.

If the error occurs again, I will post the log (now it's set to verbose).

_sherpa_a at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 16

Glad to hear its working for you. Hopefully, ejp can chime in on the repurcussions of having a high handshakeTimeout. Which JRE are you using? Are you using the JVM in server or client mode? Is this a server class machine? (what is the CPU and RAM)

Since this is linux, I know by default you can have several processes that run early in the morning that do things like update the 'locate' database. Sometimes you may have a tool the runs to index your files for searching within them. At this time your JVM can be slowed if those other processes do not play nice. Maybe you can bump up the priority of your JVM process.

Anyway, if the problem is as we suspect, then your solution should solve it. And it is an acceptable solution as well. But still need to hear the downside of the handshake setting.

_dnoyeBa at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 17

Well the downside is that if the other end's RMI really is down or misbehaving you will take 10 minutes to find out instead of 10 seconds or whatever the default is. I would try tuning this timeout a bit. 10 minutes for a read is a very long time - I suggest 60 seconds, and get it down to 30 if possible.

ejpa at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 18

It's server class machine (only AMD 64bit opteron and 1gb RAM, but dedicated only for this task). We use J2SE 5.0 and the VM is in server mode as it's the only one for AMD64. There should be no exhausting processes starting in the morning etc. From the character of server process it can temporary be high on cpu.

Now, i realized that it has crashed again with this error (after six days). There is nothing in the log although the sun.rmi.transport.tcp.logLevel=verbose. But I found possible problem. As the server and client were running both on the same machine and the -xmx property was not setup properly (it exceeded avaible memory on the machine - I thought that there are 2gb of mem, not only the one - thanks for remind), the problem could be connected when system starts to swap. I hope this is the case.

Note: for some type of application setting handshake timeout to more than default 1 minute seems to be helpful.

_sherpa_a at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 19

Thats a good catch. That could certainly be your problem. On that same note, if you set the Xms too high I think you can see a similar problem. I'm not very sure if the 5.0 GC is smarter now.

Anyway, so long as you are just testing its fine to leave your code this way. Once you solve this problem I would change your code so that you have a retry strategy. I personally prefer not to hold the remote reference as live unless there is some logic/stateful reason why you do. If you don't hold the rem reference as live, then this problem can not happen.

Anyway, if you do hold it as live, instead of crashing, just send an entry to a log about the error, and proceed to re-get the remote reference. As a network aware program you do not want to crash when the network goes down, just start making a note in the log of the event.

_dnoyeBa at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 20

We have almost same thing, but this exception is thrown during lookup operation

java.rmi.ConnectIOException: error during JRMP connection establ

ishment; nested exception is:

java.net.SocketTimeoutException: Read timed out

java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:

java.net.SocketTimeoutException: Read timed out

at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:273)

at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:171)

at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:306)

at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)

at java.rmi.Naming.lookup(Naming.java:84)

in our case, there is overloaded linux server with load average >20. We are thinking of implementing custom socket factory for RMI and use setSoTimeout for read/write operations.

Could somebody please forecast, will we solve this problem in this way?

b0fh_uaa at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 21

Probably not, and it's doing it the hard way. Have a look at the java.rmi and sun.rmi system properties in the links at the bottom of http://java.sun.com/j2se/1.5.0/docs/guide/rmi/index.html. There are lots of properties you can set to control connect and read timeouts at both ends. Easier to tune that way too.

ejpa at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 22

Okay, as I learned the read timed out is not the case, because default value is 2 hours.

From another side, I've learned from sources the default connection timeout is not set to Socket.connect(timeout) method. Instead it uses AsyncConnector to connect to remote peer.

In the documentation you pointed me I didn't find anything about how to set connection timeout for socket connections.

Am I miss something?

b0fh_uaa at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...
# 23

> Okay, as I learned the read timed out is not the

> case, because default value is 2 hours.

That's sun.rmi.transport.tcp.readTimeout and it applies where the 'incoming connections' are, i.e. at the server, not the client. For the client read timeout there is sun.rmi.transport.tcp.responseTimeout, also in milliseconds, default infinity. I don't think it's documented.

> In the documentation you pointed me I didn't find

> anything about how to set connection timeout for

> socket connections.

Contrary to what the documentation says, sun.rmi.transport.proxy.connectTimeout is used to time out all connections. There is also sun.rmi.transport.tcp.handshakeTimeout.

ejpa at 2007-7-21 17:46:40 > top of Java-index,Core,Core APIs...