Unrecognised Char in GB2312 character set using java InputStreamReader?
Reading the following file chinese GB2312 html file from
http://news.xinhuanet.com/local/2007-02/13/content_5732705.htm
using the InputStreamReader with GB2312 encoding as shown below
public class readGB2312html file
{
//........TmpText declarations.....
public static void main( String[] args )
{
try
{
FileInputStream is = new FileInputStream( args[0] );
BufferedReader br = new BufferedReader(
new InputStreamReader( is, "GB2312" ) );
String strLine;
while ( (strLine = br.readLine()) != null )
{
TmpText.append(strLine);
TmpText.append("\r\n");
}
br.close();
bw.close();
}
catch ( Exception e )
{
e.printStackTrace();
}
}
}
The TmpText variable does not display the last character in the article properly (记者夏珺) it gives instead (记者夏?B)
Inside the html file the unrecognised charcter is represented by 珺 in the html file Why is this so
(记者夏珺)
In the internet browser it is displayed and recognised as a chinese GB2312 character why not recognised by Java InputStreamReader?
Any help or explanation would be much appreciated
[1307 byte] By [
ngstevea] at [2007-11-26 20:43:56]

# 1
I think the answer may be that the character is not actually a correctly encoded Gb2312 character. When I try to find it in this code page table, it is not found: http://ash.jp/code/cn/gb2312tbl.htm
All the other characters I have tried from the same xinhua page are found just fine. Also, when I run a file converter on the file (converting from GB2312 to UTF-8), the character in question is replaced by a question mark, while all other characters are converted correctly.
Browsers can provide very unpredictable results, so I would not conclude that the character is valid in the code page simply because a browser renders it as a Chinese character.
# 3
Yes, it is not a GB2312 character
The 珺 character is AC40 in hex format which is outside of the GB2312 character range, it is in GBK
Copied from wikipedia,
GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.
GB stands for National Standard, while K stands for Extension. GBK not only extended the old standard GB2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the "rong" (鎔) character in former Chinese Premier Zhu Rongji's name, are now representable.
Thanks a lot will use the GBK charset to read the file for all GB2312 file since it is a subset of it.