how to feed CharsetDecoder (nio Buffer problems)

Hi there,

I have no idea how to feed bytes into a CharsetDecoder in presence of multibyte sequences. I am trying something along the following lines.

If a multibyte char is being fed (I tried c3 +a4 = ?, the first call to decode returns UNDERFLOW with bout.hasRemaining() == false as expected, but the 2nd call yields MALFORMED[1], no matter what rewind/flip method I throw in at // *1*

I probably "misgrok" something here with Buffers. Can someone point out what?

ByteBuffer bin = ByteBuffer.allocate(3);

CharBuffer bout = CharBuffer.allocate(1);

Charset cs = Charset.forName("UTF-8");

CharsetDecoder dec = cs.newDecoder();

publicvoid addChar(int ch )throws Exception{

bin.put((byte)ch);

bin.flip();

CoderResult res = dec.decode(bin,bout,false);

bout.flip();

if( bout.hasRemaining() ){

emit( bout.get() );

bin.clear();

bout.clear();

}else{

bout.flip();

//bin.flip(); // *1*

}

}

[1501 byte] By [HolgerKa] at [2007-11-27 9:32:57]
# 1
I would try to feed it a lot more than a byte at a time. Regardless of anything else, feeding a byte or two at a time will fail half the time with UNDERFLOW exceptions, which can't be too efficient.
ejpa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...
# 2

Unfortunately this is not an option for me to read larger chunks of data. In this case I could just continue using InputStreamReader anyway.

I have to read byte by byte in order to be capable to switch to another charset at certain points.

By the way - why does decode return with an underflow even when a single byte has successfully been decoded?

HolgerKa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...
# 3
It probably hasn't been decoded. It's probably been consumed but the next byte is still required to decode it and generate the next character.
ejpa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...
# 4

But the result character is already in the output buffer?! Applying my program code to ISO-8859-1, all character yield underflows and the data is successfully being converted.

Here is a complete test program:

import java.io.*;

import java.nio.*;

import java.nio.charset.*;

public class DecoderTest {

public static void main( String[] args ) throws Exception {

dec = Charset.forName(args[1]).newDecoder();

new DecoderTest().run(args[0]);

}

ByteBuffer bin = ByteBuffer.allocate(3);

CharBuffer bout = CharBuffer.allocate(1);

static CharsetDecoder dec;

public void run( String filename ) throws Exception {

FileInputStream in = new FileInputStream(filename);

int ch;

while( (ch=in.read())>=0 ) {

addChar(ch);

}

}

public void addChar( int ch ) throws Exception {

System.out.print( "byte "+(0xff & ch) + " => " );

bin.put((byte)ch);

bin.flip();

CoderResult res = dec.decode(bin,bout,false);

System.out.print( res + " => " );

if( res.isError() ) {

bin.clear();

bout.clear();

} else {

bout.flip();

if( bout.hasRemaining() ) {

System.err.print( "char " + ((int)bout.get()) );

bin.clear();

bout.clear();

} else {

bout.flip();

// what to do with bin here?

}

}

System.err.println();

}

}

Passing in iso latin code lopoks like this:

byte 98 => UNDERFLOW => char 98

byte 228 => UNDERFLOW => char 228

byte 104 => UNDERFLOW => char 104

byte 32 => UNDERFLOW => char 32

byte 98 => UNDERFLOW => char 98

byte 228 => UNDERFLOW => char 228

byte 104 => UNDERFLOW => char 104

byte 10 => UNDERFLOW => char 10

Utf-code looks like this:

byte 98 => UNDERFLOW => char 98

byte 195 => UNDERFLOW =>

byte 164 => MALFORMED[1] =>

byte 104 => UNDERFLOW => char 104

byte 32 => UNDERFLOW => char 32

byte 98 => UNDERFLOW => char 98

byte 195 => UNDERFLOW =>

byte 164 => MALFORMED[1] =>

byte 104 => UNDERFLOW => char 104

byte 10 => UNDERFLOW => char 10

Message was edited by:

HolgerK

HolgerKa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...
# 5

Well according to the Javadoc 'Underflow is reported when there is no more input to be processed, or there is insufficient input and additional input is required.' So the fact that you passed 'false' would be enough to cause the underflow.

Instead of clearing 'bin' you should compact it, and do that regardless of the result. Where you have the second bout.flip(), that should be bout.clear(), and the correct operation on 'bin' is still bin.compact().

ejpa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...
# 6

I missed compact() completely and it solved my problem. Thanks a lot!

here the working version (for the archive):

public void addChar( int ch ) throws Exception {

System.err.print( "byte "+(0xff & ch) + " => " );

bin.put((byte)ch);

bin.flip();

bout.clear();

CoderResult res = dec.decode(bin,bout,false);

System.err.print( res + " => " );

if( res.isError() ) {

bin.clear();

} else {

bout.flip();

if( bout.hasRemaining() ) {

bin.clear();

System.err.print( "char " + ((int)bout.get()) );

} else {

bin.compact();

}

}

System.err.println();

}

HolgerKa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...
# 7
Thanks for providing the working code.I would still use compact() everywhere you are presently using clear().
ejpa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...
# 8

Aye, this makes the code much more readable.

Thanks again for the NIO lesson!

Here the final code:

public void addChar( int ch ) throws Exception {

bin.put((byte)ch);

bin.flip();

bout.compact();

CoderResult res = dec.decode(bin,bout,false);

bin.compact();

if( ! res.isError() ) {

bout.flip();

if( bout.hasRemaining() ) {

emit( bout.get() );

}

}

}

HolgerKa at 2007-7-12 22:52:43 > top of Java-index,Core,Core APIs...