Regex for international character sets like [a-zA-Z] + ? Ł, ? etc.

Hi all!

Is there an internationalized equivalent for the pattern "[a-zA-Z]+" which would accept the French, German, Polish or other local characters like "Ł" or "? or "?

It's a bad idea to add all of them to the standard regex like "[a-zA-Z潼?#321;糸]+", since you never know what comes next.

I tried to use the "\\p{Alpha}+" to check the string "Łukasz", but the string didn't match. Is there any other possibility other than to exclude "=", ">", "+", "?, etc.?

Thanks for any help!

Lena

[538 byte] By [Lena_a] at [2007-11-27 6:29:06]
# 1
try using: \\p{L}
Dalzhima at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 2
System.out.println(Pattern.matches("\\p{L}+", "Łukasz"));prints false :(
Lena_a at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 3
What character encoding are you using to use Ł ?
Dalzhima at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 4

You shouldn't write non-ASCII characters in source code; it's not portable and if you don't know what you are doing the compiler may misunderstand you. This is what has happened here: the Ł has been replaced by something that is not a letter. Characters outside of the ASCII range should be encoded using "unicode escape codes," like this:"\u0141ukasz".matches("\\p{L}+") // returns true

You can use the native2ascii tool to convert text files from any text encoding to ASCII with unicode escape codes.

jsalonena at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 5

My application will get the UTF-8 encoded string and has to ensure it represents a valid name.

To ensure the passed string is UTF-8 I used:

Charset.forName("UTF-8").encode("Łukasz")

for the eclipse console tests, but the matching still failes:

System.out.println(Pattern.matches("\\p{L}+", Charset.forName("UTF-8").encode("Łukasz").toString()));

returns false

Lena_a at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 6
Sorry, Sorry!I've missed the part, encode() returns a byte buffer. With Charset.forName("UTF-8").encode("Łukasz").asCharBuffer().toString() the string matches.You are great! Thanks a lot!
Lena_a at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 7

> My application will get the UTF-8 encoded string and

> has to ensure it represents a valid name.

There is no such thing as a "utf-8 encoded string." In Java strings are internally encoded in one of the 16-bit tranformation formats but you don't need to know this most of the time.

The code you post works accidentally. It does not work in the general case. Just do what I said in my previous post.

Try this code for instance:System.out.println(Charset.forName("UTF-8").encode(">8<>")

.asCharBuffer().toString().matches("\\p{L}+"));

It will print "true" because the resulting charbuffer does match the pattern even though the original string does not. To understand why this happens I recommend reading the API documentation of the relevant classes.

jsalonena at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 8
Nice, I didn't know there was a native2ascii tool provided with the SDK. So that's how you found the unicode code for that weird character ! :D
Dalzhima at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 9
Actually I used BeanShell and just copied and pasted. http://beanshell.org/ It's very handy for quickly checking small things like results of function calls without actually making a test program.
jsalonena at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...
# 10
I didn't know about BeanShell. Gotta love it. I always thought it would be cool to have such a console but I didn't know someone had already made my dream come true! I should have just ******* googled it... ;S
Dalzhima at 2007-7-12 17:52:59 > top of Java-index,Java Essentials,Java Programming...