MASSIVE Unix Encoding Problem
Hi!
Thanks to anyone in advance who tries a suggestion with this problem:
I have a database (PostgreSQL, Encoding UNICODE) which stores a TEXT, which has a certain format. The problem with this format is, that I used as a end delimiter the section sign "?. This works very nice under windows, but unter unix, it doesn't get the delimiter.
Here are some codesnippets to show you, what's going on:
Resultset rs;
[...]
String encodedText = rs.getString("text")
[transferred to the GUI]
publicstaticfinalchar DELIM_END ='?;
[that was pretty stupid]
BufferedReader =new BufferedReader(new StringReader(encodedText));
Then I read the BufferedReader in for the delim signs, but the end sign is never found, which makes it pretty ugly as you can imagine. Under Unix (Linux or Mac) the sign, which is in the encodedString, when I debug my application is the "currency sign" (Unicode: 00A4, you can see it at http://www.decodeunicode.org/ when you typ in "00A4"). Interestingly the section sign has the unicode 00A7.
I tried now so many different types of streams with different code pages and so on. Does anyone have an idea what I could do about this?
[1480 byte] By [
jedesa] at [2007-10-2 21:19:47]

> Hi!
>
> Thanks to anyone in advance who tries a suggestion
> with this problem:
>
> I have a database (PostgreSQL, Encoding UNICODE)
> which stores a TEXT, which has a certain format. The
> problem with this format is, that I used as a end
> delimiter the section sign "?. This works very nice
> under windows, but unter unix, it doesn't get the
> delimiter.
>
> Here are some codesnippets to show you, what's going
> on:
> >
> Resultset rs;
> [...]
> String encodedText = rs.getString("text")
> [transferred to the GUI]
>
> public static final char DELIM_END = '?;
> [that was pretty stupid]
>
> BufferedReader = new BufferedReader(new
> StringReader(encodedText));
>
> Then I read the BufferedReader in for the delim
> signs, but the end sign is never found, which makes
> it pretty ugly as you can imagine. Under Unix (Linux
> or Mac) the sign, which is in the encodedString, when
> I debug my application is the "currency sign"
> (Unicode: 00A4, you can see it at
> http://www.decodeunicode.org/ when you typ in
> "00A4"). Interestingly the section sign has the
> unicode 00A7.
>
> I tried now so many different types of streams with
> different code pages and so on. Does anyone have an
> idea what I could do about this?
My Linux machine has no problem with the following code:
char DELIM_END = '?;
String test = "This is a testf using the section symbol as a delimiter";
java.io.BufferedReader br = new java.io.BufferedReader(new java.io.StringReader(test));
StringBuffer sb = new StringBuffer();
int i;
while ((i = br.read()) != -1) {
if ((char)i != DELIM_END) {
sb.append((char)i);
} else {
System.out.println(sb.toString());
sb.delete(0, sb.length());
}
}
System.out.println(sb.toString());
The output is:
This is a test
of using the section symbol as a delimiter
@Niceguy1:
Sorry I wasn't quite clear on something. My formatted text that was saved into the database by windows CANNOT be read under linux because of the section sign. There is the "currency sign" there where the section sign should be. However, when I save my formatted text under Linux to the database, the Linux client can open it correctly. The "currency sign" also appears then in the database. Of course what I want is that any client saves the text in the same way and can interpret it the same way.
@BIJ001 & @uncle_alice:
I'm also thinking of a kind of driver issue in the retrieval, when I get the string out of the ResultSet, then, I think, it's already too late. The places where the section sign should be are undisplayable under Linux (tried that with all kinds of Stream and Codepage combinations).
When I look at my Database with the EMS SQL Manager for PostgreSQL it says ENCODING=UNICODE, which may be means UTF-8?
So the question could be, how do I get the JDBC driver or the ResultSet working for the windows encoding under unix systems?
It really shouldn't matter what the database's actual encoding is. When they say it's "UNICODE", they're really just saying that it will always be the same and that it will handle any valid Unicode character. But at some point, some other part of the system appears to be expecting the the text to be in the OS's default encoding when it's really in "UNICODE", or vice versa. I'm really just hoping to spark ideas in someone who knows more about this aspect of database programming than I do; I don't even know what questions to ask.
One possible approach, though, is suggested by [url=http://linux.kieser.net/java_pg_unicode.html]this article[/url]: read the text as an array of bytes and convert it yourself. I don't know why the author expects the text to be in that particular encoding (looks like pure blue-sky optimism to me), but you can lock that down by the same technique: when you're inserting text, convert it to a byte array in the encoding of your choice (I would use UTF-8) and use the setBytes() method in your PreparedStatement.
But please don't tell anyone that I offered this as a solution. I don't know if it will work, and even if it does, it's a putrid hack. I don't know who's to blame, but you should not be having this problem.