Regex for international character sets like [a-zA-Z] + ? Ł, ? etc.
Hi all!
Is there an internationalized equivalent for the pattern "[a-zA-Z]+" which would accept the French, German, Polish or other local characters like "Ł" or "? or "?
It's a bad idea to add all of them to the standard regex like "[a-zA-Z潼?#321;糸]+", since you never know what comes next.
I tried to use the "\\p{Alpha}+" to check the string "Łukasz", but the string didn't match. Is there any other possibility other than to exclude "=", ">", "+", "?, etc.?
Thanks for any help!
Lena
[538 byte] By [
Lena_a] at [2007-11-27 6:29:06]

You shouldn't write non-ASCII characters in source code; it's not portable and if you don't know what you are doing the compiler may misunderstand you. This is what has happened here: the Ł has been replaced by something that is not a letter. Characters outside of the ASCII range should be encoded using "unicode escape codes," like this:"\u0141ukasz".matches("\\p{L}+") // returns true
You can use the native2ascii tool to convert text files from any text encoding to ASCII with unicode escape codes.
My application will get the UTF-8 encoded string and has to ensure it represents a valid name.
To ensure the passed string is UTF-8 I used:
Charset.forName("UTF-8").encode("Łukasz")
for the eclipse console tests, but the matching still failes:
System.out.println(Pattern.matches("\\p{L}+", Charset.forName("UTF-8").encode("Łukasz").toString()));
returns false
Lena_a at 2007-7-12 17:52:59 >

Sorry, Sorry!I've missed the part, encode() returns a byte buffer. With Charset.forName("UTF-8").encode("Łukasz").asCharBuffer().toString() the string matches.You are great! Thanks a lot!
Lena_a at 2007-7-12 17:52:59 >

> My application will get the UTF-8 encoded string and
> has to ensure it represents a valid name.
There is no such thing as a "utf-8 encoded string." In Java strings are internally encoded in one of the 16-bit tranformation formats but you don't need to know this most of the time.
The code you post works accidentally. It does not work in the general case. Just do what I said in my previous post.
Try this code for instance:System.out.println(Charset.forName("UTF-8").encode(">8<>")
.asCharBuffer().toString().matches("\\p{L}+"));
It will print "true" because the resulting charbuffer does match the pattern even though the original string does not. To understand why this happens I recommend reading the API documentation of the relevant classes.