Buggy encoding implementation? Buffering issue?
I'm working on an ajax application. I have a servlet that packs builds html, packs it into xml, and sends it out to a browser that uses javascript to unpack it. For some reason, at a similar position in the XML stream every time, there is an isolated error in the encoding
More specifically... (and this is a test implementation, so please no model2 or overarching architecture critiques...)
1. a request comes in to a servlet
2. the servlet maps the request to an object, which is responsible for building a page
3. the page is packed into a single (fairly large) string
4. this string is encoded via the following method
private String
encode(
String toEncode
)
throws UnsupportedEncodingException
{
String retString =new String(toEncode);
retString = URLEncoder.encode(retString,"utf-8");
Matcher matcher = Pattern.compile("[\\+]").matcher(retString);
retString = matcher.replaceAll("%20");
return retString;
}
note that the client side decoder can't handle the (rather-annoying) '+' encoding of space characters, so I am manually converting '+'es to the more standard '%20' encoding.
the resulting encoded string is concatenated into another string which contains appropriate XML tags
this final string is then sent out of the servlet via a single 'println' call to a HttpServletResponse.PrintWriter.
Of course, I can't really see how the client is perceiving the HTML via its own decoder, but I've been manually decoding sample output streams with perl, and I've consistently been finding a single encoding artifact somewhere around the 93rd/94th line of the decoded HTML (even when I've changed the page being sent), and this artifact is consistent with the page distortions that I've been seeing in the browser. The HTML after the artifact also seems to be okay.
Is there a maximum size of string that I can use with Matcher or URL Encoder?
Is there a better way to get the '+'es encoded into '%20's?
Is there some reason why these fairly basic, high use java classes should be choking, other than I'm feeding them oversized strings (on the order of 5~6k)?
[2499 byte] By [
dcaudella] at [2007-11-27 11:43:25]

There might be a size limit. Instead of sending it in one huge string, break it down and send it character by character and put it back together at the other end.
Did this help?
No: http://forum.java.sun.com/thread.jspa?forumID=31&threadID=259647
your strings could be too large. how many characters long is it?
Look, I really appreciate that you're trying to help me, but I really don't know if you're quite at the level where you should be offering people advice, yet.
Just two threads up from this one or so, I just caught you trying to tell somebody that the solution to his socket code throwing exceptions is just to wrap it in a try/catch. You obviously are a little inexperienced.
Anyway...
The bottom line of my post specifies exactly how big the strings are... (5~6k) right now. According to other threads, there is apparently a very large string size limit, which I'm not exceeding, as the errors I'm seeing are appearing in the middle of an otherwise fine string.
Ignoring the business about string lengths, it seems to me that you are testing the output of this method by sending it across the network and maybe processing it through something else (although I couldn't really tell that). So you're trying to test too much. Have you tried testing the method in a stand-alone environment?
The only thing that stands out in your escape method is "utf-8" -- the docs say it should be capitalized. Oh, and there's no need for creating a new String in the first line.
Regardless, it should be easy to write a unit test that throws some large strings at it, and verifies that they get encoded properly.
However, the behavior that you describe -- malformed text at a specific point in the string -- does indicate a buffering problem. And it seems that the most likely point for this problem is elsewhere in your code:
> 3. the page is packed into a single (fairly large)
> string
How is it packed into that string? Do you perhaps have a buffer that gets reused, or written without regard to size of its contents? Are there perhaps NULs buried somewhere in the string?
> Is there some reason why these fairly basic, high use
> java classes should be choking
Nope. You'd be better off looking at the rest of your code.
It appears that the UTF encoding and "+" replacement have an interaction. Results are different depending on the placement of the "+" replacement. I second DrClap's suggestion of a test - here's some code I tried:
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
public class Zz
{
public static void main(String[] args)
{
String retString = "asdf asdf+asdf";
retString = retString.replaceAll("[+]", "%20");
System.out.println(retString);
try
{
retString = URLEncoder.encode(retString, "UTF-8");
}
catch (UnsupportedEncodingException ex)
{
ex.printStackTrace();
}
//retString = retString.replaceAll("[+]", "%20");
System.out.println(retString);
}
}
You're right. Thanks for the good idea. I am just going over a loopback socket... but I wrote the stream down to a file, and then sent it out and captured it... there's a single character's difference. And it's some kind of whitespace.
Actually... it's pretty cool whitespace, too! It causes Window's copy/paste to misbehave. This is probably overkill, but I took Axe to it, and it's an
0x 0D 0A 0D 0A
What on earth is that? And what causes java to inject strange characters into a servlet output stream?
That's 2 Windows' CR LF pairs.
Okay, I'm pretty sure of this now; 0x0D0A0D0A is a double cr-lf, and it's being inserted sometime after I call HttpServletResponse's printwriter's println()...
> Okay, I'm pretty sure of this now; 0x0D0A0D0A is a
> double cr-lf, and it's being inserted sometime after
> I call HttpServletResponse's printwriter's
> println()...
Well, println() terminates the line with a platform-specific line terminator, which on Windows is <CR><LF>. The sequence <CR><LF><CR><LF> is also specifically used by the HTTP protocol to delimit the header and body of the message, so be sure that you're not getting confused there.
I'm going to echo DrClap's comment from earlier today: you're trying to test too much at once. Go back, and insert debugging code at different places in your JSP. I'd suggest writing the contents of the string to a known file location before and after encoding, and then dumping those debug files.
You might find the Jakarta IO libraries helpful in this: http://jakarta.apache.org/commons/io/apidocs/index.html
If you're going to do client-server debugging, you should probably use a tool that captures the actual network traffic and allows you to examine it. If you're using Firefox, try out Firebug ( http://www.getfirebug.com/ ). If you're using IE, try out Fiddler ( http://www.fiddlertool.com/ ) -- actually, I'd suggest using Fiddler regardless of what your browser is. You'll have to configure Firefox to use it as a proxy, but I think the various views that it provides of the request and response are incredibly useful.
> If you're going to do client-server debugging, you
> should probably use a tool that captures the actual
> network traffic
I've found ethereal to be quite useful.
jverda at 2007-7-29 17:49:53 >

Well, looks like the extra cr-lf's were being inserted at sometime during the process of copying & pasting buffer dumps from window to window. In the end, a combination of Axe, Windiff, and HTTPSpy (and of course dumping to disk before sending packets) verified without a doubt that I was wrong, and the buffer I was printing from the servlet was actually hitting firefox on the client...
I'm pretty sure I'm packing and encoding the string properly, as when I use perl to unpack the buffer, it comes out clean.
so right now it looks like the UTF-8 decoder in the javascript interpreter in firefox isn't happy with overly large strings, but I'm still in the process of verifying this.
There's a lot of very useful feedback from a good number of well-experienced looking individuals on this thread. Thanks for your help, peoples!