How to recognize if String contains foreign characters

Hi guys,

I need to write a method that takes the user input from a webpage and recognize if it contains foreign characters. By foreign characters I mean characters in any other language other than English.

The only idea that comes to mind is to check the ASCII value of the characters and if it is a value other than [A-Z,a-z,0-9] - mark it as foreign character.

Any thoughts

Thanks

[413 byte] By [gubloooa] at [2007-10-3 7:36:38]
# 1

> I need to write a method that takes the user input

> from a webpage and recognize if it contains foreign

> characters. By foreign characters I mean characters

> in any other language other than English.

>

> The only idea that comes to mind is to check the

> ASCII value of the characters and if it is a value

> other than [A-Z,a-z,0-9] - mark it as foreign

> character.

Those characters are not "English only"For instance "dos".Is that the English acronym for Disk Operating System or the Spanish word for two?Since languages share characters I'm not sure you can say 100% of the time if it's foreign.

Gary_Pinkhama at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 2

The method canEncode in the following code sample tests if a given string can be encoded in a specific encoding. You can test if your given text can be encoded in "Western European" encoding (windows-1252 or ISO-8856-1) or in the "US-ASCII" encoding scheme. I hope this solves your problem.

import java.io.*;

public class Test

{

// tests if string s can be encoded in encoding cs

public static boolean canEncode(String s, String cs)

{

try

{

ByteArrayOutputStream baos = new ByteArrayOutputStream();

Writer w = new OutputStreamWriter(baos, cs);

w.write(s);

w.close();

String t = baos.toString(cs);

return s.equals(t);

}

catch(Exception e)

{

return false;

}

}

public static void main(String[] args)

{

System.out.println(canEncode("Hello", "windows-1252")); // Should print true.

}

}

ghasemkiania at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 3

Well according to my requirements DOS would still qualify as English. Only characters like these are considered foreign 二次核试验

Would something as simple as this work instead of digging into ASCII values and stuff:

String ENGLISH_LANG = "[a-zA-z0-9]";

public static boolean isEnglishLang(String field)

{

Pattern mask = Pattern.compile(ENGLISH_LANG);

Matcher matcher = mask.matcher(field);

if (!matcher.matches())

{

return false;

}

return true;

}

So if a user enters characters in any other language than English, it would return false.

Thanks

gubloooa at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 4
Thanks for the solution you provided. Let me check if that works for different characters in chinese and japanese.But what do you think about the simple pattern matching code I provided. Would that work.Thanks
gubloooa at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 5
Another possibility is to use Character.UnicodeBlock class to detect the characters in a string are in BASIC_LATIN category.BTW, is your code not required to allow punctuations, symbols, etc? For example, how does it handle a word like "isn't" or "a.m."?
naotoa at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 6
no the only punctuation that is allowed is a dot "."Its basically for domain names so other punctuation marks are not allowedExamplesyahoo.com - contains English characters二次核试验.com.cn - contains foreign characters
gubloooa at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 7
There was a mistyping in the character set name. The correct is ISO 8859-1 (Latin 1) character set. Sorry.
ghasemkiania at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 8

Hi ghasemkiani

THanks for your response. Based on your solution, I came up with this idea. Can you give me your input whether this would work.

private boolean isEnglish(String field)

{

boolean result=true;

try

{

String domain=java.net.URLEncoder.encode(field, "UTF-8");

if(domain.equals(field))

result=true;

else

result=false;

}

catch(Exception e)

{

result=false;

}

return result;

}

gubloooa at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 9

The API documentation for URLEncoder class states the method of encoding:

When encoding a String, the following rules apply:

The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.

The special characters ".", "-", "*", and "_" remain the same.

The space character " " is converted into a plus sign "+".

All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

http://java.sun.com/j2se/1.5.0/docs/api/java/net/URLEncoder.html

So, this seems to be a suitable solution. Unless you want to allow space character or not allow the above-mentioned special characters, in which case there will be problems with this solution.

ghasemkiania at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...
# 10
But this solution will faileven if someone enters an english domain but enter an invalid character:eg java,com --> will be returned as not english because the comma is invalid but it is not a foreign language character.Any other solution
gubloooa at 2007-7-15 2:36:54 > top of Java-index,Desktop,I18N...