Localisation of strings with UTF8

Hi all,

I have a string that comes in from a URL into an app with internationlised characters in it, say the string "d閚i". It comes through to the server as the string "d%E9ni" - all well and good. But then if I decode it using the URLDecoder.decode(string, "utf8") - the string gets garbled to "d?ni".

I thought that UTF8 handled all character encoding, and should handle this situation? Should I use a different character encoding? Or will I be able to convert the string from this garbled form back to it's origional form at a later stage using some decode method?

thanks,

Justin

[614 byte] By [justinlawlera] at [2007-11-27 5:08:20]
# 1

Well, not wanting to hurt your feeling, but I think you should read more about encodings.

Your URL string is "d%E9ni". This suffices to let us know that the string was encoded using Latin1 (or Windows-1252), but certainly NOT in UTF-8.

Why?

Because the character "? in Latin1 or in UTF16 has the value of E9 (leading zeros omitted). If it is encoded in UTF8, the value is C3 A9. Thus, the corresponding URL string should have been "d%C3%A9ni".

As a general rule, all those characters whose Latin1 (or UTF16) values are in between A0 and FF are all encoded in TWO bytes in UTF8. So when you see that in the URL string one %xx corresponding to one character in this range, it's 100% sure that the string was NOT encoded in the assumption of UTF8.

horiniusa at 2007-7-12 10:27:46 > top of Java-index,Desktop,I18N...
# 2

no, not hurting my feelings at all. All this encoding is really new to me, thanks for your help.

I was wondering, how would a servlet or server application get the encoding of the string passed into it from a web browser. We have our own server that can take http requests, so we're using sockets, and not a servlet container. Is the encoding embedded in the connection string at all? I examined it, and couldn't find it. There were a load of header fields, but none that looked like an encoding type.

thanks again,

Justin

justinlawlera at 2007-7-12 10:27:46 > top of Java-index,Desktop,I18N...
# 3

> I was wondering, how would a servlet or server

> application get the encoding of the string passed

> into it from a web browser.

The simple answer is that the encoding is, unfortunately to say, implicit!! Or should I rather say, assumed! The lack of explicit encoding in HTTP/HTML is a sad fact. Not even HTML4 had ratified this! Well...

So, if you could make sure the client is using a certain encoding, you're OK. But, read below ....

> We have our own server

> that can take http requests, so we're using sockets,

> and not a servlet container. Is the encoding embedded

> in the connection string at all? I examined it, and

> couldn't find it. There were a load of header fields,

> but none that looked like an encoding type.

There is none, no need to look for it.

Back to your implied question of how to get the correct string, you have to ask yourself one essential question:

Do you have full control on the clients?

If you ask what my question means, think about this. If you're making a web page supposed to be accessed by everybody, the answer is NO. Because there's a percentage (even if it's small) of users using a non HTML3 or 4 compliant browser, or some "exotic" browsers. In this case, you can't controlly 100% what's sent in the URL string or HTTP header.

However, if you're making your own HTTP clients, then of course the answer is YES.

But according to your description, I'm not sure what kind of HTTP client you're targetting.

Then on the server side, well ... I don't know much about sockets in this case. However, I know how to cope with encoding problem if you're using JSP (or servlets). Do you want some explanations/examples?

horiniusa at 2007-7-12 10:27:46 > top of Java-index,Desktop,I18N...