MASSIVE Unix Encoding Problem

Hi!

Thanks to anyone in advance who tries a suggestion with this problem:

I have a database (PostgreSQL, Encoding UNICODE) which stores a TEXT, which has a certain format. The problem with this format is, that I used as a end delimiter the section sign "?. This works very nice under windows, but unter unix, it doesn't get the delimiter.

Here are some codesnippets to show you, what's going on:

Resultset rs;

[...]

String encodedText = rs.getString("text")

[transferred to the GUI]

publicstaticfinalchar DELIM_END ='?;

[that was pretty stupid]

BufferedReader =new BufferedReader(new StringReader(encodedText));

Then I read the BufferedReader in for the delim signs, but the end sign is never found, which makes it pretty ugly as you can imagine. Under Unix (Linux or Mac) the sign, which is in the encodedString, when I debug my application is the "currency sign" (Unicode: 00A4, you can see it at http://www.decodeunicode.org/ when you typ in "00A4"). Interestingly the section sign has the unicode 00A7.

I tried now so many different types of streams with different code pages and so on. Does anyone have an idea what I could do about this?

[1480 byte] By [jedesa] at [2007-10-2 21:19:47]
# 1

> Hi!

>

> Thanks to anyone in advance who tries a suggestion

> with this problem:

>

> I have a database (PostgreSQL, Encoding UNICODE)

> which stores a TEXT, which has a certain format. The

> problem with this format is, that I used as a end

> delimiter the section sign "?. This works very nice

> under windows, but unter unix, it doesn't get the

> delimiter.

>

> Here are some codesnippets to show you, what's going

> on:

> >

> Resultset rs;

> [...]

> String encodedText = rs.getString("text")

> [transferred to the GUI]

>

> public static final char DELIM_END = '?;

> [that was pretty stupid]

>

> BufferedReader = new BufferedReader(new

> StringReader(encodedText));

>

> Then I read the BufferedReader in for the delim

> signs, but the end sign is never found, which makes

> it pretty ugly as you can imagine. Under Unix (Linux

> or Mac) the sign, which is in the encodedString, when

> I debug my application is the "currency sign"

> (Unicode: 00A4, you can see it at

> http://www.decodeunicode.org/ when you typ in

> "00A4"). Interestingly the section sign has the

> unicode 00A7.

>

> I tried now so many different types of streams with

> different code pages and so on. Does anyone have an

> idea what I could do about this?

My Linux machine has no problem with the following code:

char DELIM_END = '?;

String test = "This is a testf using the section symbol as a delimiter";

java.io.BufferedReader br = new java.io.BufferedReader(new java.io.StringReader(test));

StringBuffer sb = new StringBuffer();

int i;

while ((i = br.read()) != -1) {

if ((char)i != DELIM_END) {

sb.append((char)i);

} else {

System.out.println(sb.toString());

sb.delete(0, sb.length());

}

}

System.out.println(sb.toString());

The output is:

This is a test

of using the section symbol as a delimiter

Niceguy1a at 2007-7-14 0:29:08 > top of Java-index,Java Essentials,Java Programming...
# 2
Is it not a PostgreSQL or driver issue?
BIJ001a at 2007-7-14 0:29:09 > top of Java-index,Java Essentials,Java Programming...
# 3
There's no such thing as UNICODE encoding--perhaps you mean UTF-8? Do you ever specify the encoding that you expect the database to be using? (Can you? I'm not familiar with JDBC.) Anyway, it looks like the error is occurring at the retrieval stage, not during output.
uncle_alicea at 2007-7-14 0:29:09 > top of Java-index,Java Essentials,Java Programming...
# 4

@Niceguy1:

Sorry I wasn't quite clear on something. My formatted text that was saved into the database by windows CANNOT be read under linux because of the section sign. There is the "currency sign" there where the section sign should be. However, when I save my formatted text under Linux to the database, the Linux client can open it correctly. The "currency sign" also appears then in the database. Of course what I want is that any client saves the text in the same way and can interpret it the same way.

@BIJ001 & @uncle_alice:

I'm also thinking of a kind of driver issue in the retrieval, when I get the string out of the ResultSet, then, I think, it's already too late. The places where the section sign should be are undisplayable under Linux (tried that with all kinds of Stream and Codepage combinations).

When I look at my Database with the EMS SQL Manager for PostgreSQL it says ENCODING=UNICODE, which may be means UTF-8?

So the question could be, how do I get the JDBC driver or the ResultSet working for the windows encoding under unix systems?

jedesa at 2007-7-14 0:29:09 > top of Java-index,Java Essentials,Java Programming...
# 5

I have to correct myself:

When I look at the properties of the database under pgAdminIII it says that the encoding is unicode. It also gives the SQL Statement:

CREATE DATABASE foo

WITH OWNER = postgres

ENCODING = 'UNICODE'

TABLESPACE = pg_default;

But under EMS SQL Manager when I look at the Database Registration Info it says:

Client Encoding [NONE]

Font charset DEFAULT_CHARSET

I'm confused! Whom to believe... are there differences between "Client Encoding" and "Encoding"? And why is there also a "Font charset"?Can I give the Client Encoding and the Font Charset in the SQL Statement?

jedesa at 2007-7-14 0:29:09 > top of Java-index,Java Essentials,Java Programming...
# 6

It really shouldn't matter what the database's actual encoding is. When they say it's "UNICODE", they're really just saying that it will always be the same and that it will handle any valid Unicode character. But at some point, some other part of the system appears to be expecting the the text to be in the OS's default encoding when it's really in "UNICODE", or vice versa. I'm really just hoping to spark ideas in someone who knows more about this aspect of database programming than I do; I don't even know what questions to ask.

One possible approach, though, is suggested by [url=http://linux.kieser.net/java_pg_unicode.html]this article[/url]: read the text as an array of bytes and convert it yourself. I don't know why the author expects the text to be in that particular encoding (looks like pure blue-sky optimism to me), but you can lock that down by the same technique: when you're inserting text, convert it to a byte array in the encoding of your choice (I would use UTF-8) and use the setBytes() method in your PreparedStatement.

But please don't tell anyone that I offered this as a solution. I don't know if it will work, and even if it does, it's a putrid hack. I don't know who's to blame, but you should not be having this problem.

uncle_alicea at 2007-7-14 0:29:09 > top of Java-index,Java Essentials,Java Programming...