Problem on Linux with terminated client and Selector.select()

RE: Fedora Core 4

java version "1.5.0_06"

Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)

Java HotSpot(TM) Server VM (build 1.5.0_06-b05, mixed mode)

All,

I'm running into a problem on Linux that I cannot recreate on Windows or Mac OS X. We have a server that accepts client connections (primarily from desktop apps) using NIO. The connections are made once and left open as long as the client continues to run (i.e. this is not a web server like scenario where connections are made and torn down on each request). We're encountering a problem on only Linux that happens when the client app is killed (terminated from Task Manager or killed from the command line with a kill -kill). I'm sure the first comment from most is why would this be happening regularly, and my answer is that it is not, but it can happen, and, when we tested for it, we found problems. Specifically, once a client has connected, we register a selection key with read interest on the channel. We then have a thread that sits in a while loop and does the following:

int readyChannelsCount = this.selector().select();

if(readyChannelsCount > 0) {

Iterator selectedKeys = this.selector().selectedKeys().iterator();

... process the channels that are ready

The issue on linux is that, if we kill a client that is not currently in the middle of a read or a write, no IOException is thrown on the server, and the select() keeps returning with zero ready channels. On Windows and Mac OS X, the select returns a ready channel for the killed client, and the selection key says that it is ready for a read. When the server tries to read from the channel, an IOException is thrown, and we shut down the connection properly. This is exactly what I want. However, as I say, on Linux, the behavior is that the readyChannelsCount is always zero and the select returns everytime we call it, so, we waste a ton of CPU cycles processing "phantom" selects. What I've noticed on Linux is that all of the selection keys that correspond to the killed clients are valid (selectionKey.isValid() == true), but they all have their interest ops and their ready ops set to zero (I have no clue how they get into this state). I was thinking of implementing a workaround to this problem that involves iterating through all the selector's keys to see if any of the selection keys are in this state where they have no interest or ready ops. I would do this only when select returns zero. If anyone has any thoughts on this problem or my proposed workaround, I would really appreciate hearing them.

[2635 byte] By [paulrslgsa] at [2007-10-3 4:47:37]
# 1

I'd have a look at the Bug Parade about this but your proposed workaround does make sense.

I generally like to do a scan of all registered keys whenever select returns zero, as it gives you an opportunity to detect things like outbound connections that have never completed, channels that are blocked waiting for an OP_WRITE that never comes, &c, and you can take action, depending on what the interestOps are. In most cases the action is to close the channel with some logging.

ejpa at 2007-7-14 22:52:00 > top of Java-index,Core,Core APIs...
# 2

ejp et al,

thanks for the reply. My workaround is having a large negative impact on performance. It seems that there are lots of times on Linux when Selector.select() is returning 0. I haven't found this exact issue in the Bug Parade, but there seem to be a few issues that smack of this same problem: 6429790 , 6371630 and 6403933. Unfortunately, I still don't really have a solution, and this worries me greatly. I can't have an entire server basically shut down or dramatically slow down because a client app is terminated.

paulrslgsa at 2007-7-14 22:52:00 > top of Java-index,Core,Core APIs...
# 3

I'm also very interested in determining if this is a known bug (on Linux only, as Windows/Mac functionality appears to work). I could not find an exact match in the list of known bugs other than the following bug which has evidently already been fixed:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4850373

Does anyone have further ideas?

- Eric

oppositereactiona at 2007-7-14 22:52:00 > top of Java-index,Core,Core APIs...
# 4

The hint in there about OP_CONNECT and OP_WRITE is interesting.

IMO Sun have made a major blunder trying to separate these two things, as you can tell from this bug report and many others.

It means that a client socket channel must go through all the following stages:

1. creation

2. set non-blocking

3. connect

4. register for OP_CONNECT only

5. get OP_CONNECT, try finishConnect(), if true register for OP_READ and/or OP_WRITE only, getting rid of OP_CONNECT at this stage.

6. When you read -1 or get a SocketException or IOException, close the channel.

7. If that would terminate your select loop, call selectNow() once to really close the channel.

Does your code do all that?

ejpa at 2007-7-14 22:52:00 > top of Java-index,Core,Core APIs...
# 5

FYI, I believe I fixed this problem by AVOIDING calls to "isValid()". I was basically not performing certain socket operations if "isValid()" returned false. I was assuming I should not be operating on the system when a given selector was no longer valid. However, this is the wrong thing to do. I was not allowing the IOException to fire when a given selector/channel was no longer valid, and that was preventing the channel from closing down properly, thus giving me the error(s) I reported above.

In fact, I don't understand why Sun even provides an "isValid()" method in the first place. The method is not synchrnonized with the other methods on the selector/channel, so even if the operation isValid to begin with, it might not be valid on the very next line of code. So I just avoid calling isValid all together and just handle the other exceptions appropriately. All seems to be well now. Thanks for all the suggestions.

oppositereactiona at 2007-7-14 22:52:00 > top of Java-index,Core,Core APIs...
# 6
Thanks, this is a very good point.
ejpa at 2007-7-14 22:52:00 > top of Java-index,Core,Core APIs...