How to recognize if String contains foreign characters
Hi guys,
I need to write a method that takes the user input from a webpage and recognize if it contains foreign characters. By foreign characters I mean characters in any other language other than English.
The only idea that comes to mind is to check the ASCII value of the characters and if it is a value other than [A-Z,a-z,0-9] - mark it as foreign character.
Any thoughts
Thanks
[413 byte] By [
gubloooa] at [2007-10-3 7:36:38]

> I need to write a method that takes the user input
> from a webpage and recognize if it contains foreign
> characters. By foreign characters I mean characters
> in any other language other than English.
>
> The only idea that comes to mind is to check the
> ASCII value of the characters and if it is a value
> other than [A-Z,a-z,0-9] - mark it as foreign
> character.
Those characters are not "English only"For instance "dos".Is that the English acronym for Disk Operating System or the Spanish word for two?Since languages share characters I'm not sure you can say 100% of the time if it's foreign.
The method canEncode in the following code sample tests if a given string can be encoded in a specific encoding. You can test if your given text can be encoded in "Western European" encoding (windows-1252 or ISO-8856-1) or in the "US-ASCII" encoding scheme. I hope this solves your problem.
import java.io.*;
public class Test
{
// tests if string s can be encoded in encoding cs
public static boolean canEncode(String s, String cs)
{
try
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer w = new OutputStreamWriter(baos, cs);
w.write(s);
w.close();
String t = baos.toString(cs);
return s.equals(t);
}
catch(Exception e)
{
return false;
}
}
public static void main(String[] args)
{
System.out.println(canEncode("Hello", "windows-1252")); // Should print true.
}
}
Well according to my requirements DOS would still qualify as English. Only characters like these are considered foreign 二次核试验
Would something as simple as this work instead of digging into ASCII values and stuff:
String ENGLISH_LANG = "[a-zA-z0-9]";
public static boolean isEnglishLang(String field)
{
Pattern mask = Pattern.compile(ENGLISH_LANG);
Matcher matcher = mask.matcher(field);
if (!matcher.matches())
{
return false;
}
return true;
}
So if a user enters characters in any other language than English, it would return false.
Thanks
Thanks for the solution you provided. Let me check if that works for different characters in chinese and japanese.But what do you think about the simple pattern matching code I provided. Would that work.Thanks
Another possibility is to use Character.UnicodeBlock class to detect the characters in a string are in BASIC_LATIN category.BTW, is your code not required to allow punctuations, symbols, etc? For example, how does it handle a word like "isn't" or "a.m."?
no the only punctuation that is allowed is a dot "."Its basically for domain names so other punctuation marks are not allowedExamplesyahoo.com - contains English characters二次核试验.com.cn - contains foreign characters
There was a mistyping in the character set name. The correct is ISO 8859-1 (Latin 1) character set. Sorry.
Hi ghasemkiani
THanks for your response. Based on your solution, I came up with this idea. Can you give me your input whether this would work.
private boolean isEnglish(String field)
{
boolean result=true;
try
{
String domain=java.net.URLEncoder.encode(field, "UTF-8");
if(domain.equals(field))
result=true;
else
result=false;
}
catch(Exception e)
{
result=false;
}
return result;
}
The API documentation for URLEncoder class states the method of encoding:
When encoding a String, the following rules apply:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
http://java.sun.com/j2se/1.5.0/docs/api/java/net/URLEncoder.html
So, this seems to be a suitable solution. Unless you want to allow space character or not allow the above-mentioned special characters, in which case there will be problems with this solution.
But this solution will faileven if someone enters an english domain but enter an invalid character:eg java,com --> will be returned as not english because the comma is invalid but it is not a foreign language character.Any other solution