Unrecognised Char in GB2312 character set using java InputStreamReader?

Reading the following file chinese GB2312 html file from

http://news.xinhuanet.com/local/2007-02/13/content_5732705.htm

using the InputStreamReader with GB2312 encoding as shown below

public class readGB2312html file

{

//........TmpText declarations.....

public static void main( String[] args )

{

try

{

FileInputStream is = new FileInputStream( args[0] );

BufferedReader br = new BufferedReader(

new InputStreamReader( is, "GB2312" ) );

String strLine;

while ( (strLine = br.readLine()) != null )

{

TmpText.append(strLine);

TmpText.append("\r\n");

}

br.close();

bw.close();

}

catch ( Exception e )

{

e.printStackTrace();

}

}

}

The TmpText variable does not display the last character in the article properly (记者夏珺) it gives instead (记者夏?B)

Inside the html file the unrecognised charcter is represented by 珺 in the html file Why is this so

(记者夏珺)

In the internet browser it is displayed and recognised as a chinese GB2312 character why not recognised by Java InputStreamReader?

Any help or explanation would be much appreciated

[1307 byte] By [ngstevea] at [2007-11-26 20:43:56]
# 1

I think the answer may be that the character is not actually a correctly encoded Gb2312 character. When I try to find it in this code page table, it is not found: http://ash.jp/code/cn/gb2312tbl.htm

All the other characters I have tried from the same xinhua page are found just fine. Also, when I run a file converter on the file (converting from GB2312 to UTF-8), the character in question is replaced by a question mark, while all other characters are converted correctly.

Browsers can provide very unpredictable results, so I would not conclude that the character is valid in the code page simply because a browser renders it as a Chinese character.

one_danea at 2007-7-10 2:04:40 > top of Java-index,Desktop,I18N...
# 2
I think I found your problem: the page is encoded in GBK, not gb2312. If I use a GBK converter, the character gets recognized just fine.Proves the point that you can never trust the charset information provided in web pages.
one_danea at 2007-7-10 2:04:40 > top of Java-index,Desktop,I18N...
# 3

Yes, it is not a GB2312 character

The 珺 character is AC40 in hex format which is outside of the GB2312 character range, it is in GBK

Copied from wikipedia,

GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.

GB stands for National Standard, while K stands for Extension. GBK not only extended the old standard GB2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the "rong" (鎔) character in former Chinese Premier Zhu Rongji's name, are now representable.

Thanks a lot will use the GBK charset to read the file for all GB2312 file since it is a subset of it.

ngstevea at 2007-7-10 2:04:40 > top of Java-index,Desktop,I18N...