Java Telnet client crashes TCP stack

We have a problem in which, on certain machines, a Java Telnet client connected to a specific host will lock up, and crash the TCP stack of the machine it's running on.

On the test box we're running it on (Win2k running SE 1.6.0_01-b06), the problem occurs during initial negotiations: after the "NEW-ENVIRON" (including auto-logon) and "TERMINAL-TYPE" (a TN5250-specific type, probably "IBM-3477-FC") subnegotiations are completed, where negotiations would normally continue with will/do negotiations on "end of record" and "binary," it locks up trying to do a SocketInputStream.read(). At this point, the machine runs through a minute or more of alternating repeatedly between several seconds in which the mouse pointer is frozen, and several seconds in which it is responsive. By the time this is over, the TCP stack is toast.

On another test box that's also Win2k running SE 1.6.0_01-b06, everything works normally.

On a customer machine, it gets through the negotiation fine, but fails somewhat later in the session, with the same symptoms.

Has anybody here seen anything like this?

[1121 byte] By [hbquikcomjamesla] at [2007-11-27 8:35:53]
# 1
Impressive.What is this telnet library? Did you write yourself or get it from somewhere?
cotton.ma at 2007-7-12 20:32:35 > top of Java-index,Core,Core APIs...
# 2

My first inclination would be to ask how you have decided it's crashing the TCP stack.

For example poorly written network IO code can cause the processor to spin quite a bit, perhaps to the point of crashing.

I think you'll need to post your formatted code for ejp to look through.

cotton.ma at 2007-7-12 20:32:35 > top of Java-index,Core,Core APIs...
# 3

> My first inclination would be to ask how you have

> decided it's crashing the TCP stack.

Well, I suppose it could be crashing the Ethernet card or its driver instead.

At any rate, though, once it's happened, the machine is totally bottled. It can't access the Internet; it can't access shared network drives, and that state continues until it's at least warm-booted.

--

JHHL

hbquikcomjamesla at 2007-7-12 20:32:35 > top of Java-index,Core,Core APIs...
# 4
Well I think we need to see code.As far as your basic question goes, no this is not a common complaint around here.
cotton.ma at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 5

I FOUND SOMETHING!

Both the transmitString method, used in initial negotiations, and the transmitData method, used once the TN5250/TN5250e protocol is established, use the transmitChar() method, given below (where "os" is the socket's output stream):

/* Transmit a character */

private void transmitChar(int ch) {

try {

os.write(ch);

os.flush();

if (debug) System.out.println("Sent char " + ch + " (hex " + Util.toHex(ch) + ")");

}

catch(IOException e) {Util.p("Client.transmitChar(): " + e.getMessage());e.printStackTrace();}

}

Either way, once there's a buffer of data ready to send, this method is called repeatedly until the entire transmission is sent. Note also that the socket's input stream is being monitored for incoming data on a completely separate thread from what's writing to the socket's output stream, which simply blocks until data arrives.

It occurred to me over the weekend that if the data somehow didn't get completely sent, the host might timeout, causing the "Connection reset" exception on the socket. So I commented out the "if(debug)" on the System.out.println() call, so that I could see whether we were at least *trying* to send all the bytes.

Lo and behold, with the diagnostic turned on, it refused to fail.

Could I in some way be transmitting the bytes faster than the machine's Ethernet card could handle them? Or something similar?

hbquikcomjamesla at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 6

Aha.

(a) A connection reset isn't a 'crash [of] the TCP stack'. It is the deliberate resetting by the stack of a single TCP connection in response to receiving a TCP RST segment. The TCP stack keeps running. You don't have to reboot. Hava doesn't cause such faults in my experience and neither does anything else to the network. Next time I suggest you post the actual exception or error message, not what you think it means.

(b) It is encountered when writing to a socket which has already been closed by the other end of the connection.

(c) It indicates an error in the application protocol. In this case you are sending something that the server doesn't like so it is disconnecting you.

ejpa at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 7

The connection reset was only part of the problem. If it's even that. And it's certainly not the most dramatic or obvious of the symptoms.

The entire system, once the fault occurs, begins alternating between a seemingly normal state and a totally unresponsive one, and by the time this oscillation ends, the computer is responsive, but it is completely cut off from the network.

Which is what I said in the initial message.

More specifically, as stated in another of my messages on this thread, it leaves the system in a state wherein

> It can't access the Internet; it can't access shared network drives,

> and that state continues until it's at least warm-booted.

And it is only known to occur on two machines: a customer box in which it happens once the TN5250 protocol is established, and one of our test boxes, in which it was occurring during initial protocol negotiation. And only when connecting to a specific host. And even on that client box, connected to that host, it only appears to occur if there's another connection going to the same server, but that may be a red herring.

Adding a half-millisecond delay ("wait(0,500000)") at the beginning of transmitChar() causes the affected test box to get through the negotiation fine, but fail somewhat later, much like the customer box. A one-millisecond delay ("wait(1)") had the same effect. The next step, tomorrow, is to try a 10-millisecond delay. But that still brings up the question, has anybody ever managed to overload a network card by writing to a socket output stream too fast for either the card or the driver?

Incidentally, on the affected box (running Win2k), if Task Manager is up at the time, the periods of unresponsiveness show a sharp spike in CPU usage, but whatever is causing the unresponsiveness takes over the machine so completely that nothing shows up on the task list (and indeed, the CPU usage graph also stops moving, and so, I suspect, does the TOD clock.

--

JHHL

hbquikcomjamesla at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 8
In that case I would either change the network card or its driver or reinstall the operating system. Most probably the hardware is at fault here. TCP/IP can't send too fast for the NIC, the driver should stop that from happening.
ejpa at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 9

Earlier, we tried pulling out every extraneous background task we could find on our test box, and running at least two different checks for malware.

It went from crashing TCP to crashing the operating system and causing a spontaneous reboot. But we did also determined that it was happening with a 1.5 JVM as well as 1.6.

We just did a scratch-reinstall of Win2k on the test box today, and it's still crashing the operating system and causing a spontaneous reboot.

And it's only certain machines.

Nothing makes sense.

hbquikcomjamesla at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 10
Well at least you've eliminated some things. It wasn't the O/S, it wasn't Java, it wasn't the background tasks.I would replace all the NICs with a different brand.
ejpa at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 11

More:

We tried something that we should have tried much, much earlier. Last night, I wondered if the problem also occurred in Secured Telnet (i.e., SSL TN5250 over Port 992). So this morning, I checked to see if the customer in question had the Secured Telnet server running on their AS/400. It turns out they did, so we switched our application to use Secured mode, and ran another test.

Secured mode works fine.

But that still kind of begs the question of what could be causing the problem.

Could this in some way be analogous to the old "Ping of Death"?

hbquikcomjamesla at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...
# 12
The problem is clearly a hardware problem. There is nothing you can send over TCP/IP that can cause an O/S crash. Otherwise the Internet wouldn't work 5% as well as it does.
ejpa at 2007-7-12 20:32:36 > top of Java-index,Core,Core APIs...