prevent doubtle-byte / multi-byte characters
I have a J2EE web application using Struts. I want to accept ASCII characters as well as other European Languages that have Single-Byte Character Sets (SBCS). But I want to prevent Double-Byte / Multi-Byte Characters (DBCS).
I want to prevent the user from entering double byte / multi byte characters on a Web Form. So if the user enters any characters other than single-byte characters then when the user submits the form, I want to show a validation error message that only single-byte characters are allowed.
In order to do the above, I first need to determine if the user entered Single-byte or double-byte characters. Let's say if the user typed ONE Chinese character (double byte) in a text box and submitted the form. On receiving the form, when I do request.getParameter() to retrieve the value, the String variable contains TWO characters instead of one. If I check each of these two characters using regular expression to see if they are in the range of /x00 to /xFF, they are always in that range. That is because each character individually will fall within that range but the above two chars together will indicate that this is a double-byte character.
I went through various i18n (internationalization) articles, forums, etc. but nowhere does it mention how to determine if the data entered by the user in a web / html form is single-byte or double-byte.
Does anyone have any information / knowledge on this? Any help / suggestions will be appreciated.
[1499 byte] By [
AMAa] at [2007-11-26 14:22:35]

# 1
If you get 2 characters when you do request.getParameter(), then you are using an encoding such as 8859-1, and you have irrecoverably lost your chance of figuring out what the user intended to enter (since the individual code points will be valid in 8859-1).
If you want to do this, use UTF-8 encoding everywhere, then you will know exactly which character was entered, and you can then use something like UnicodeSet (http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html) in ICU to figure out which Unicode block it belongs to (using the East_Asian_Width property to detect what you call "double-byte" characters: http://icu.sourceforge.net/userguide/properties.html).
However, if you are actually using 8859-1 (or another legacy encoding), then you probably want to exclude not just what you call "double-byte" characters, but also a lot of other characters that do not exist in your legacy code page - characters from Indic languages, Vietnamese, Thai, Arabic, Russian, Greek, etc. In which case you probably want to create a very small Unicode set of allowed characters and test against that - making everything else invalid.
# 2
I want to use UTF-8 encoding and I don't want to or have to use ISO-8859-1.
I have put UTF-8 page encoding on the JSP that actually displays the HTML Data Entry form:
<%@ page language="java" contentType="text/html; charset=UTF-8"%>
I have also configured Struts to UTF-8 and have made the following entry in the struts-config.xml:
<controller processorClass="org.apache.struts.action.RequestProcessor" contentType="text/html;charset=UTF-8">
</controller>
In my TARGET ACTION class, I have tried the following (simple example below):
request.setCharacterEncoding("UTF8");
String s= request.getParameter("t1");
System.out.println("s = "+ s);
if (s != null & !"".equals(s)) {
System.out.println("s.length() = "+ s.length()); //<-- still displays length=2
char[] charArr = s.toCharArray();
System.out.println("charArr.length = "+ charArr.length); //<-- still displays length=2
byte[] byteArr = s.getBytes("UTF-8");
System.out.println("byteArr.length = "+ byteArr.length);//<-- still displays length=2 or more
}
I read through the documentation links that you suggested but I am still not clear how to use it (UnicodeSet / ICU userguide). My requirement is to display a message to the user if anything other than Single Byte Character Set is entered. So ASCII and some of the European language characters (that are within the range of SBCS) are permissible.
I will really appreciate if you can share an example / sample code showing how to restrict any thing other than SBCS.
AMAa at 2007-7-8 2:14:41 >

# 3
Well, this code works for me:
String newname = request.getParameter("name");
int length = newname.length();
out.println (""
+ "Length of name entry: " + length);
UnicodeSet EASet = new UnicodeSet("[:ea=Wide:]");
out.println (""
+ "Name entry not contained in East Asian Wide Set? " + EASet.containsNone(newname));
It correctly shows a string length of 1 if I input a Chinese or Japanese character like 跋, 涉 or 水 (and counts them correctly if I enter several or combine them with ASCII letters). So somewhere in your code you are getting your parameter mangled into 2 code points (a legacy code page being used by default somewhere, most likely).
And the sample UnicodeSet snippet using ICU4J also correctly returns "false" if I use Chinese or Japanese characters in the name entry, but "true" in all other cases (that I have tested so far).
So it should be relatively simple to implement something that suits your needs.
# 4
First of all thanks for the inputs.
Which application server do you use? I am using WebSphere 5.0 Runtime through IBM RAD 6 IDE. I wonder if there is a setting to be made on the server to retain the correct behavior where if one Chinese / Japanese / Korean character is submitted then it still stays as 1 char in Java when I do request.getParameter("submittedChar").length().
I had mentioned earlier that I set page and request encoding to UTF-8 and I assumed that by default UTF-8 will always break one double-byte character into two or more Java chars but that there would be a way to determine in Java that those 2 or more UTF-8 chars together indicate that it's one Char.
Based on what you have described above, I think things will be resolved for me if I can get the C/J/K character to stay as ONE char.
Will appreciate if you can give more details about your environment...
- What app server do you use?
- Did you have to configure it so that one char remained one char?
AMAa at 2007-7-8 2:14:41 >

# 5
You are mixing up bytes and characters:
"I assumed that by default UTF-8 will always break one double-byte character into two or more Java chars but that there would be a way to determine in Java that those 2 or more UTF-8 chars together indicate that it's one Char."
No, UTF-8 never 'breaks one double-byte character into two or more Java chars' .
There is no such thing as a 'double-byte character' in Java - that is a term from legacy code pages.
UTF-8 is a variable-length Unicode encoding, where each character may take up from 1 to 4 bytes. But UTF-8 does not 'break one double-byte character into two or more Java chars' - it simply uses a variable number of bytes to encode a single character, but it is always one character, where each character uses a unique code point sequence. It is only if you take a string encoded in UTF-8 and read it as for instance 8859-1 that it can get converted to 2 characters.
When you use UTF-8 in your JSP pages and perform operations on strings etc., Java takes the UTF-8 byte arrays and converts them to UTF-16 for internal processing , but as long as you treat them as strings or characters, that should be handled for you.
So when I grab a parameter (in the code I posted) and create a string from it, a Chinese character may have taken up 3 bytes in UTF-8, but that is irrelevant - it is one Java character, which is process as UTF-16 internally in Java (and always was one character).
Take a look at the FAQs on the various Unicode encoding forms, that may help clear up some things for you:
http://www.unicode.org/faq/utf_bom.html
There may be many reasons for the problem you are seeing (which seems to be caused by an incorrect conversion to a legacy encoding happening somewhere, as stated in my previous post), you really need to step through this yourself and verify at which point your Chinese character becomes garbled.
This tech tip by John O'Connor would be a good starting point:
http://java.sun.com/developer/EJTechTips/2005/tt1220.html#2
# 6
Thanks for clarifying.
I have tried to step through and even removed Struts from the picture by testing one sample jsp with page encoding UTF submitting to another JSP where I do request.getParameter(). My String length is still TWO although I submitted only ONE Chinese character.
Based on your comments, it really makes me think that there may be a setting on my WebSphere server environment to ensure that UTF-8 doesn't wrongly get encoded into something else there by resulting into String length of TWO in my above example.
That's why I was wondering if you had to make any specific setting on your app server?
AMAa at 2007-7-8 2:14:41 >

# 7
I am only using tomcat, no appserver, as I am using it only as a small test environment. So sorry, can't offer any WAS settings.
But, in looking at your latest comment and your previous postings, I did realize that there is one odd thing about your result: you get a string length of 2 for one Chinese character. You should really be getting a string length of 3 if the theory of a conversion to 8859-1 somewhere holds true - since CJK characters take up 3 bytes in UTF-8 and thus would have been converted to 3 8859-1 characters.
Which specific Chinese characters are you testing with? And what are the hex values of the 2 characters that Java reports back to you?
# 8
Okay, so Tomcat didn't require any setting for this I guess.
My computer doesn't have any the foreign language packs and so I pick some English translated into Chinese text (actually only one character) from http://babelfish.altavista.com. It appears as a rectangle to me but I am pasting it here if it appears correctly to you:
国
This is the output of my basic testing without Struts:
s.length = 2
char[] charArr = s.toCharArray();
charArr.length = 2
charArr = ?
charArr = ?
byte[] byteArr = s.getBytes("UTF-8");
byteArr.length = 4
byteArr2 = -62
byteArr2 = -115
byteArr2 = -62
byteArr2 = -111
******************************************
Here's the code of my first.jsp:
<%@page language="java" pageEncoding="UTF-8" contentType="text/html; charset=UTF8"%>
<html>
<body>
<form name="f1" action="second.jsp" method="post">
<input type="text" name="t1"/>
<input type="submit" value="submit"/>
</form>
</body>
</html>
******************************************
Here's the code of my second.jsp:
<html>
<body>
<%
//Try commenting and uncommenting this request.setCharacterEncoding line
//request.setCharacterEncoding("UTF8");
String s= request.getParameter("t1");
System.out.println("s = "+ s);
%>
s = <%=s%>
<%
if (s != null & !"".equals(s)) {
System.out.println("s.length() = "+ s.length());
%>
s.length = <%= s.length()%>
char[] charArr = s.toCharArray();
<%
char[] charArr = s.toCharArray();
System.out.println("charArr.length = "+ charArr.length);
%>
charArr.length = <%=charArr.length%>
<%
if (charArr != null && charArr.length > 0) {
for (int i = 0; i < charArr.length; i++) {
System.out.println("charArr = "+ charArr);
%>
charArr = <%=charArr%>
<%
}
}// end of if (charArr != null && charArr.length > 0)
%>
byte[] byteArr = s.getBytes("UTF-8");
<%
byte[] byteArr = s.getBytes("UTF-8");
if(byteArr != null && byteArr.length > 0) {
System.out.println("byteArr.length = "+ byteArr.length);
%>
byteArr.length = <%= byteArr.length%>
<%
for (int i = 0; i < byteArr.length; i++) {
System.out.println("byteArr = "+ byteArr);
%>
byteArr2 = <%=byteArr%>
<%
}
}// end of if (byteArr != null && byteArr.length > 0)
%>
<%
}//end of if (s != null & !"".equals(s))
%>
</body>
</html>
AMAa at 2007-7-8 2:14:41 >

# 9
The article at http://java.sun.com/developer/EJTechTips/2005/tt1220.html#2 is excellent and it comprehensively describes most of the scenarios that we faced and tried to resolve.
It talks about server settings for Tomcat and Sun:
Setting URIEncoding="UTF-8" in Tomcat's connector settings within the server.xml file will communicate the charset choice to the web server. On Sun Java System Application Server 8.1, you can include <parameter-encoding default-charset="UTF-8"/> in the sun-web.xml file.
I am trying to find WAS / WebSphere setting to see if I can get closer to resolving this. Meanwhile, any further inputs would be appreciated.
AMAa at 2007-7-8 2:14:41 >

# 10
OK, I implemented your 2 pages with the code your posted - with 2 changes in the second jsp:
Removed the comment from the line
//request.setCharacterEncoding("UTF8");
and added the following line:
<%@page language="java" pageEncoding="UTF-8" contentType="text/html; charset=UTF8"%>
With those changes, I get the following output, using your sample character below (国):
s = 国
s.length = 1
char[] charArr = s.toCharArray();
charArr.length = 1
charArr = 国
byte[] byteArr = s.getBytes("UTF-8");
byteArr.length = 3
byteArr2 = [B@3f2c3f2c
byteArr2 = [B@3f2c3f2c
byteArr2 = [B@3f2c3f2c
(not sure what the last part of your code is supposed to do, but the main results are OK).
You are saying that you get a length of 2 returned for the char tests - what is your result for the length of the byte array - also 2?
It would seem that it is indeed something in your environment setup (WAS) that is wrong - or with the way the data is input. However, since the Chinese character you pasted in here looks perfectly fine to me, the latter is unlikely.
# 11
I continue to get:
charArr.length = 2
byte[] byteArr = s.getBytes("UTF-8");
byteArr.length = 3
:-(
I am still searching on a way to set encoding in WebSphere / WAS. If I am able to replicate the behavior of your environment on my environment, then I too will get ONE char. My guess is that if I then validate that ONE char against my regular expression (through Struts validation.xml) ^[\x00-\xFF]*$, then it not fall in the \x00-\xFF range and will correctly give validation error message that only SBCS is allowed.
AMAa at 2007-7-8 2:14:41 >

# 12
The range you list is a range from a legacy codepage, not a Unicode range (the Unicode range would be \u0000 to \u00FF) . But if you apply it to Unicode, it will allow only 8859-1 characters and exclude a whole slew of other characters that exist in other singlebyte legacy code pages. So you really should change your terminology - if you want to allow only Latin-1 characters, then that is what you should say.
Sorry to harp on this question of terms - but there is too much confusion going around on this subject already :-)
And I agree, if you can get the correct result from the test, then you should be able to implement a very simple regular expression (since you actually want to exclude all characters other than those in the the Latin-1 set).
# 13
Latin-1 set. Let me research on what exactly it means.
I really appreciate all your help in narrowing my problem. You are correct in identifying gap in my terminology and that is because it is being used very loosely by my Client too. My lack of knowledge on the exact nature of this problem together with loose terminology is really adding to all the confusion. Thanks for bearing with me through all this and helping me define my requirement correctly.
Btw... still no luck with my WAS settings. I am beginning to think that may be I require a different version of installation that supports i18n instead of the US version.
AMAa at 2007-7-8 2:14:41 >

# 14
There should be no difference between WebSphere language versions with respect to what they support, so you should not need to change your WAS version. I'll let you know if I find anything useful or get an epiphany.
# 15
Could you paste the first line of output you see on the second JSP page, where I get:
s = 国
I would like to see which characters your browser displays there, that might give a hint as to what is going on.
Also, which browser do you use to test this, Firefox or IE? Or something different?
# 16
> The article at
> http://java.sun.com/developer/EJTechTips/2005/tt1220.h
> tml#2 is excellent and it comprehensively describes
> most of the scenarios that we faced and tried to
> resolve.
I'm glad that the tech tip helped you. I also suggest the more complete article here:
http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
I hope that this helps. If you have suggestions for other articles, please let me know.
Regards,
John O'Conner
john at joconner dot com
# 17
Thanks John. That article comprehensively covers all cases, including the servlet filter / context params scenario.
I wonder if there is one common place where all the J2EE Server's encoding property settings is available. I see that you had mentioned Tomcat and Sun Server. If I learn about the WebSphere setting, ever, then I will let you know. May be we can keep adding each major server's setting to your article so that it's available in one place on the Internet.
Again, thanks for your insightful article!
AMAa at 2007-7-21 16:09:36 >

# 18
> Again, thanks for your insightful article!Agreed. I post that link at least once a week in response to this sort of question.