UTF8 binary stream of characters to ASCII characters.
Hello,
I need an advice how to translate UTF8 binary stream of characters to ASCII characters. Translation will depends on the Locale (language) used.
For example, if UTF8 character ?(C381 in HEX) is used in Czech language I will need to translate it to two ASCII characters Ae; if the same ?character used in French language I will need to translate it to character A. Binary Stream will have some ACSII characters which will not need any translation as well.
Please, advise.
Thank you.
A Mickelson
U+C381 is "HANGUL SYLLABLE SSEOT"; surely it's not used in Czech?
Anyway, I don't think there's anything in the standard API for such a conversion. For removing accents you can use a "normalizer" from ICU4J, but for other kinds of conversitions you may have to write the code yourself.
http://icu.sourceforge.net/userguide/normalization.html
> U+C381 is "HANGUL SYLLABLE SSEOT"; surely it's not
> used in Czech?
Probably C381 is the two bytes used by UTF-8 to represent Á.
I suggested in the (probably accidental) cross-post to convert the bytes to Unicode using an InputStreamReader before converting the Unicode characters to their language-specific replacements.
And personally I'm used to seeing Czech words like "Neumannová" transliterated as "Neumannova", i.e. just without the accent, rather than "Neumannovae". Not my problem though as long as I don't have to read the result.
?to Ae was an example only. I have to do translation for multiple languages, so for some of them one UTF8 to multiple ASCII must be done. I will try InputStreamReader to Unicode, because String.getByte("UTF-8") didn't do any good for me.Thank you!