Convert UTF-8 to UTF-16

How do I convert UTF-8 to UTF-16. I want code to be like this..

public String convert (String string)

{

//do something with the String

//...

return string;

}

Thanks in advance..

[406 byte] By [WhatTheHecka] at [2007-11-27 8:48:54]
# 1

From Unicode standard:

<quote>

UTF-8

UTF-8 encoding form: The Unicode encoding form which assigns each Unicode

scalar value to an unsigned byte sequence of one to four bytes in length, as

specified in Table 3-5.

- In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is represented

as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where <4D> corresponds to U+004D, <D0 B0>

corresponds to U+0430, <E4 BA 8C> corresponds to U+4E8C, and <F0 90 8C 82>

corresponds to U+10302.

- Any UTF-8 byte sequence that does not match the patterns listed in Table 3-6

is illformed.

- Before the Unicode Standard, Version 3.1, the problematic non-shortest form

byte sequences in UTF-8 were those where BMP characters could be represented in

more than one way. These sequences are ill-formed, because they are not allowed

by Table 3-6.

- Because surrogate code points are not Unicode scalar values, any UTF-8 byte

sequence that would otherwise map to code points D800..DFFF is ill-formed.

Table 3-5 specifies the bit distribution for the UTF-8 encoding form, showing

the ranges of Unicode scalar values corresponding to one-, two-, three-, and

four-byte sequences. For a discussion of the difference in the formulation of

UTF-8 in ISO/IEC 10646, see Section C.3, UCS Transformation Formats.

Table 3-6 lists all of the byte sequences that are well-formed in UTF-8. A range

of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive)

is well-formed in that position. Any byte value outside of the ranges listed is

ill-formed. For example:

- The byte sequence <C0 AF> is ill-formed, because C0 is not well-formed in the

1st Byte column.

Table 3-5. UTF-8 Bit Distribution

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Scalar Value1st Byte 2nd Byte 3rd Byte 4th Byte

--+--+--+--+--

00000000 0xxxxxxx 0xxxxxxx

00000yyy yyxxxxxx 110yyyyy 10xxxxxx

zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx

000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

--+--+--+--+--

- The byte sequence <E0 9F 80> is ill-formed, because in the row where E0 is

wellformed as a first byte, 9F is not well-formed as a second byte.

- The byte sequence <F4 80 83 92> is well-formed, because every byte in that

sequence matches a byte range in a row of the table (the last row).

As a consequence of the well-formedness conditions specified in Table 3-6,

the following byte values are disallowed in UTF-8: C0 C1, F5 FF.

Table 3-6. Well-Formed UTF-8 Byte Sequences

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Code Points1st Byte 2nd Byte 3rd Byte 4th Byte

+--+--+--+--

U+0000..U+007F00..7F

U+0080..U+07FFC2..DF80..BF

U+0800..U+0FFFE0A0..BF80..BF

U+1000..U+CFFFE1..EC80..BF80..BF

U+D000..U+D7FFED80..9F80..BF

U+E000..U+FFFFEE..EF80..BF80..BF

U+10000..U+3FFFFF090..BF80..BF80..BF

U+40000..U+FFFFFF1..F380..BF80..BF80..BF

U+100000..U+10FFFF F480..8F80..BF80..BF

+--+--+--+--

Cases where a trailing byte range is not 80..BF are

in bold italic to draw attention to them. These occur

only in the second byte of a sequence.

</quote>

Below code only supports three byte CJKs hex dumps character string:

public class UTF8toUC16{

public static void main(String[] args){

String utf8 = "e799be";

String bin, binrep, uchex;

String[] bins, uc;

if (args.length > 0){

utf8 = args[0];

}

if (utf8.charAt(0) != 'e' || utf8.length() != 6){

System.err.println("This program accepts utf8 hex-string for CJK");

System.exit(1);

}

bin = Integer.toBinaryString(Integer.parseInt(utf8, 16));

binrep = "";

for (int i = 0; i < bin.length(); ++i){

binrep += (bin.charAt(i));

if ((i + 1) % 4 == 0 && (i != bin.length() - 1)){

binrep += ' ';

}

}

System.out.println(binrep);

bins = binrep.split("\\s");

uc = new String[4];

uc[0] = bins[1];

uc[1] = bins[2].substring(2) + bins[3].substring(0, 2);

uc[2] = bins[3].substring(2) + bins[4].substring(2);

uc[3] = bins[5];

uchex = "";

for (int i = 0; i < 4; ++i){

System.out.print(uc[i] +" ");

uchex += Integer.toHexString(Integer.parseInt(uc[i], 2));

}

System.out.println();

System.out.println(uchex);

System.out.println((char)(Integer.parseInt(uchex, 16)));

}

}

hiwaa at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 2

> How do I convert UTF-8 to UTF-16. I want code to be

> like this..

>

> > public String convert (String string)

> {

>//do something with the String

> //...

>

>return string;

>

>

> Thanks in advance..

This is fundamentally wrong. The characters of a Java String are always UNICODE point values implemented as UTF16 and no character conversion is going to change them from UTF16. You can convert them to a utf-8 byte array usingbyte[] bytesOfString = string.getBytes("utf-8");

sabre150a at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 3
@hiwa, There are medicines that can help you control that little problem. Dramamine is supposed to work pretty well, I hear.
uncle_alicea at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 4
thanks! its solved my problem.. thanks :-)
WhatTheHecka at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 5

> From Unicode standard:

> <quote>

> Below code only supports three byte CJKs hex dumps

> character string:

> > public class UTF8toUC16{

>

>public static void main(String[] args){

>String utf8 = "e799be";

>String bin, binrep, uchex;

>String[] bins, uc;

>if (args.length > 0){

>utf8 = args[0];

>}

> if (utf8.charAt(0) != 'e' || utf8.length() !=

> 6){

> System.err.println("This program accepts utf8

> hex-string for CJK");

>System.exit(1);

>

>

> bin =

> Integer.toBinaryString(Integer.parseInt(utf8, 16));

>binrep = "";

> for (int i = 0; i < bin.length(); ++i){

>binrep += (bin.charAt(i));

> if ((i + 1) % 4 == 0 && (i != bin.length() - 1)){

>binrep += ' ';

>

>}

> System.out.println(binrep);

>

>bins = binrep.split("\\s");

> uc = new String[4];

>uc[0] = bins[1];

> uc[1] = bins[2].substring(2) + bins[3].substring(0,

> 2);

> uc[2] = bins[3].substring(2) +

> bins[4].substring(2);

>uc[3] = bins[5];

>uchex = "";

> for (int i = 0; i < 4; ++i){

>System.out.print(uc[i] +" ");

> uchex += Integer.toHexString(Integer.parseInt(uc[i],

> 2));

>}

> System.out.println();

>System.out.println(uchex);

> System.out.println((char)(Integer.parseInt(uchex,

> 16)));

> }

>

>

thanks! its solved my problem.. thanks :-)

WhatTheHecka at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 6

> > How do I convert UTF-8 to UTF-16. I want code to

> be

> > like this..

> >

> > > > public String convert (String string)

> > {

> >//do something with the String

> > //...

> >

> >return string;

> >

> >

> > Thanks in advance..

>

> This is fundamentally wrong. The characters of a Java

> String are always UNICODE point values implemented as

> UTF16 and no character conversion is going to change

> them from UTF16. You can convert them to a utf-8 byte

> array using> byte[] bytesOfString = string.getBytes("utf-8");

>

this :-

byte[] bytesOfString = string.getBytes("utf-8");

doesnt work, thats why im looking for manual/other way.. thanks btw..

WhatTheHecka at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 7

> > > How do I convert UTF-8 to UTF-16. I want code to

> > be

> > > like this..

> > >

> > > > > > public String convert (String string)

> > > {

> > >//do something with the String

> > > //...

> > >

> > >return string;

> > >

> > >

> > > Thanks in advance..

> >

> > This is fundamentally wrong. The characters of a

> Java

> > String are always UNICODE point values implemented

> as

> > UTF16 and no character conversion is going to

> change

> > them from UTF16. You can convert them to a utf-8

> byte

> > array using> > byte[] bytesOfString = string.getBytes("utf-8");

> >

>

> this :-

> byte[] bytesOfString =

> string.getBytes("utf-8");

>

> doesnt work, thats why im looking for manual/other

> way.. thanks btw..

I'm sorry but it does convert Java UNICODE strings to a byte array using utf-8 encoding. It does not convert it to a utf-8 string because there is no such thing as a utf-8 string. I say again - there is no such thing as a utf-8 string.

sabre150a at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 8

> @hiwa, There are medicines that can help you control

> that little problem. Dramamine is supposed to work

> pretty well, I hear.

No. For this particular problem, we should not depend on the fact that a particular language currently happens to use a particular encoding, UTF16, for its inner representation of character and text string. It is nothing but an implementation detail of the language that could change in time. I think that is the basic nature of the problem on this forum thread.

hiwaa at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 9

> > @hiwa, There are medicines that can help you

> control

> > that little problem. Dramamine is supposed to

> work

> > pretty well, I hear.

> No. For this particular problem, we should not depend

> on the fact that a particular language currently

> happens to use a particular encoding, UTF16, for its

> inner representation of character and text string. It

> is nothing but an implementation detail of the

> language that could change in time. I think that is

> the basic nature of the problem on this forum thread.

But the OP's question was how to convert a UTF-16 String object to a UTF-8 String object. There's no such thing as a UTF-8 String, they are represented as UTF-16 only. He wanted to do something impossible. If he had asked how to write the String to a stream in a certaing encoding or how to convert a file's encoding, then he'd get lots of good answers.

hunter9000a at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 10

> There's no such thing as a UTF-8 String

There may be no such thing as a UTF-8 String in Java.

In a good, or bad, old generic CS term, however, a string is a line of byte we see on memory or on storage media. This forum thread is a simple code-converting exercise, in my view.

hiwaa at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 11

> > There's no such thing as a UTF-8 String

> There may be no such thing as a UTF-8 String in

> Java.

> In a good, or bad, old generic CS term, however, a

> string is a line of byte we see on memory or on

> storage media.

But the OP's code used Java Strings for both the input type and the return type. I explained that the OP could convert to bytes and showed him how to do it in one simple line.

It actually looks like the OP actually wants to convert utf-8 bytes to utf16 bytes but I don't know for sure.

> This forum thread is a simple

> code-converting exercise, in my view.

I agree but NOT String to String but String to byte[] or byte[] to String or byte[] to byte[] .

sabre150a at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...
# 12
Oh yes, I admit you're 100% right. ::::::::)
hiwaa at 2007-7-12 20:57:02 > top of Java-index,Java Essentials,Java Programming...