Convert UTF-8 to UTF-16
How do I convert UTF-8 to UTF-16. I want code to be like this..
public String convert (String string)
{
//do something with the String
//...
return string;
}
Thanks in advance..
How do I convert UTF-8 to UTF-16. I want code to be like this..
public String convert (String string)
{
//do something with the String
//...
return string;
}
Thanks in advance..
From Unicode standard:
<quote>
UTF-8
UTF-8 encoding form: The Unicode encoding form which assigns each Unicode
scalar value to an unsigned byte sequence of one to four bytes in length, as
specified in Table 3-5.
- In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is represented
as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where <4D> corresponds to U+004D, <D0 B0>
corresponds to U+0430, <E4 BA 8C> corresponds to U+4E8C, and <F0 90 8C 82>
corresponds to U+10302.
- Any UTF-8 byte sequence that does not match the patterns listed in Table 3-6
is illformed.
- Before the Unicode Standard, Version 3.1, the problematic non-shortest form
byte sequences in UTF-8 were those where BMP characters could be represented in
more than one way. These sequences are ill-formed, because they are not allowed
by Table 3-6.
- Because surrogate code points are not Unicode scalar values, any UTF-8 byte
sequence that would otherwise map to code points D800..DFFF is ill-formed.
Table 3-5 specifies the bit distribution for the UTF-8 encoding form, showing
the ranges of Unicode scalar values corresponding to one-, two-, three-, and
four-byte sequences. For a discussion of the difference in the formulation of
UTF-8 in ISO/IEC 10646, see Section C.3, UCS Transformation Formats.
Table 3-6 lists all of the byte sequences that are well-formed in UTF-8. A range
of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive)
is well-formed in that position. Any byte value outside of the ranges listed is
ill-formed. For example:
- The byte sequence <C0 AF> is ill-formed, because C0 is not well-formed in the
1st Byte column.
Table 3-5. UTF-8 Bit Distribution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Scalar Value1st Byte 2nd Byte 3rd Byte 4th Byte
--+--+--+--+--
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
--+--+--+--+--
- The byte sequence <E0 9F 80> is ill-formed, because in the row where E0 is
wellformed as a first byte, 9F is not well-formed as a second byte.
- The byte sequence <F4 80 83 92> is well-formed, because every byte in that
sequence matches a byte range in a row of the table (the last row).
As a consequence of the well-formedness conditions specified in Table 3-6,
the following byte values are disallowed in UTF-8: C0 C1, F5 FF.
Table 3-6. Well-Formed UTF-8 Byte Sequences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Code Points1st Byte 2nd Byte 3rd Byte 4th Byte
+--+--+--+--
U+0000..U+007F00..7F
U+0080..U+07FFC2..DF80..BF
U+0800..U+0FFFE0A0..BF80..BF
U+1000..U+CFFFE1..EC80..BF80..BF
U+D000..U+D7FFED80..9F80..BF
U+E000..U+FFFFEE..EF80..BF80..BF
U+10000..U+3FFFFF090..BF80..BF80..BF
U+40000..U+FFFFFF1..F380..BF80..BF80..BF
U+100000..U+10FFFF F480..8F80..BF80..BF
+--+--+--+--
Cases where a trailing byte range is not 80..BF are
in bold italic to draw attention to them. These occur
only in the second byte of a sequence.
</quote>
Below code only supports three byte CJKs hex dumps character string:
public class UTF8toUC16{
public static void main(String[] args){
String utf8 = "e799be";
String bin, binrep, uchex;
String[] bins, uc;
if (args.length > 0){
utf8 = args[0];
}
if (utf8.charAt(0) != 'e' || utf8.length() != 6){
System.err.println("This program accepts utf8 hex-string for CJK");
System.exit(1);
}
bin = Integer.toBinaryString(Integer.parseInt(utf8, 16));
binrep = "";
for (int i = 0; i < bin.length(); ++i){
binrep += (bin.charAt(i));
if ((i + 1) % 4 == 0 && (i != bin.length() - 1)){
binrep += ' ';
}
}
System.out.println(binrep);
bins = binrep.split("\\s");
uc = new String[4];
uc[0] = bins[1];
uc[1] = bins[2].substring(2) + bins[3].substring(0, 2);
uc[2] = bins[3].substring(2) + bins[4].substring(2);
uc[3] = bins[5];
uchex = "";
for (int i = 0; i < 4; ++i){
System.out.print(uc[i] +" ");
uchex += Integer.toHexString(Integer.parseInt(uc[i], 2));
}
System.out.println();
System.out.println(uchex);
System.out.println((char)(Integer.parseInt(uchex, 16)));
}
}
> How do I convert UTF-8 to UTF-16. I want code to be
> like this..
>
> > public String convert (String string)
> {
>//do something with the String
> //...
>
>return string;
>
>
> Thanks in advance..
This is fundamentally wrong. The characters of a Java String are always UNICODE point values implemented as UTF16 and no character conversion is going to change them from UTF16. You can convert them to a utf-8 byte array usingbyte[] bytesOfString = string.getBytes("utf-8");
> From Unicode standard:
> <quote>
> Below code only supports three byte CJKs hex dumps
> character string:
> > public class UTF8toUC16{
>
>public static void main(String[] args){
>String utf8 = "e799be";
>String bin, binrep, uchex;
>String[] bins, uc;
>if (args.length > 0){
>utf8 = args[0];
>}
> if (utf8.charAt(0) != 'e' || utf8.length() !=
> 6){
> System.err.println("This program accepts utf8
> hex-string for CJK");
>System.exit(1);
>
>
> bin =
> Integer.toBinaryString(Integer.parseInt(utf8, 16));
>binrep = "";
> for (int i = 0; i < bin.length(); ++i){
>binrep += (bin.charAt(i));
> if ((i + 1) % 4 == 0 && (i != bin.length() - 1)){
>binrep += ' ';
>
>}
> System.out.println(binrep);
>
>bins = binrep.split("\\s");
> uc = new String[4];
>uc[0] = bins[1];
> uc[1] = bins[2].substring(2) + bins[3].substring(0,
> 2);
> uc[2] = bins[3].substring(2) +
> bins[4].substring(2);
>uc[3] = bins[5];
>uchex = "";
> for (int i = 0; i < 4; ++i){
>System.out.print(uc[i] +" ");
> uchex += Integer.toHexString(Integer.parseInt(uc[i],
> 2));
>}
> System.out.println();
>System.out.println(uchex);
> System.out.println((char)(Integer.parseInt(uchex,
> 16)));
> }
>
>
thanks! its solved my problem.. thanks :-)
> > How do I convert UTF-8 to UTF-16. I want code to
> be
> > like this..
> >
> > > > public String convert (String string)
> > {
> >//do something with the String
> > //...
> >
> >return string;
> >
> >
> > Thanks in advance..
>
> This is fundamentally wrong. The characters of a Java
> String are always UNICODE point values implemented as
> UTF16 and no character conversion is going to change
> them from UTF16. You can convert them to a utf-8 byte
> array using> byte[] bytesOfString = string.getBytes("utf-8");
>
this :-
byte[] bytesOfString = string.getBytes("utf-8");
doesnt work, thats why im looking for manual/other way.. thanks btw..
> > > How do I convert UTF-8 to UTF-16. I want code to
> > be
> > > like this..
> > >
> > > > > > public String convert (String string)
> > > {
> > >//do something with the String
> > > //...
> > >
> > >return string;
> > >
> > >
> > > Thanks in advance..
> >
> > This is fundamentally wrong. The characters of a
> Java
> > String are always UNICODE point values implemented
> as
> > UTF16 and no character conversion is going to
> change
> > them from UTF16. You can convert them to a utf-8
> byte
> > array using> > byte[] bytesOfString = string.getBytes("utf-8");
> >
>
> this :-
> byte[] bytesOfString =
> string.getBytes("utf-8");
>
> doesnt work, thats why im looking for manual/other
> way.. thanks btw..
I'm sorry but it does convert Java UNICODE strings to a byte array using utf-8 encoding. It does not convert it to a utf-8 string because there is no such thing as a utf-8 string. I say again - there is no such thing as a utf-8 string.
> @hiwa, There are medicines that can help you control
> that little problem. Dramamine is supposed to work
> pretty well, I hear.
No. For this particular problem, we should not depend on the fact that a particular language currently happens to use a particular encoding, UTF16, for its inner representation of character and text string. It is nothing but an implementation detail of the language that could change in time. I think that is the basic nature of the problem on this forum thread.
> > @hiwa, There are medicines that can help you
> control
> > that little problem. Dramamine is supposed to
> work
> > pretty well, I hear.
> No. For this particular problem, we should not depend
> on the fact that a particular language currently
> happens to use a particular encoding, UTF16, for its
> inner representation of character and text string. It
> is nothing but an implementation detail of the
> language that could change in time. I think that is
> the basic nature of the problem on this forum thread.
But the OP's question was how to convert a UTF-16 String object to a UTF-8 String object. There's no such thing as a UTF-8 String, they are represented as UTF-16 only. He wanted to do something impossible. If he had asked how to write the String to a stream in a certaing encoding or how to convert a file's encoding, then he'd get lots of good answers.
> There's no such thing as a UTF-8 String
There may be no such thing as a UTF-8 String in Java.
In a good, or bad, old generic CS term, however, a string is a line of byte we see on memory or on storage media. This forum thread is a simple code-converting exercise, in my view.
> > There's no such thing as a UTF-8 String
> There may be no such thing as a UTF-8 String in
> Java.
> In a good, or bad, old generic CS term, however, a
> string is a line of byte we see on memory or on
> storage media.
But the OP's code used Java Strings for both the input type and the return type. I explained that the OP could convert to bytes and showed him how to do it in one simple line.
It actually looks like the OP actually wants to convert utf-8 bytes to utf16 bytes but I don't know for sure.
> This forum thread is a simple
> code-converting exercise, in my view.
I agree but NOT String to String but String to byte[] or byte[] to String or byte[] to byte[] .