regex and unicode

Hi,

I want to create a regex which I can use to validate a user inputted string (i.e. text entered in a JTextField). I'm familiar with the Pattern and Matcher classes, and I've put them inside of a new class that extends javax.swing.InputVerifier. I set the JTextField's input verifier to this new class. The problem is coming up with the regex to use when compiling the Pattern.

I know what the regex conditions are, but I can't seem to combine all of the elements into a single valid regex and it is driving me nuts!

(1) The object name cannot contain non-negative numbers, hyphens/dashes or periods. It can only contain:

[a-zA-Z0-9_]

(2) To further complicate matters, the first character in the name MUST be a letter.

\A[a-zA-Z]

(3) AND....the name cannot contain supplemental unicode chars. Since I'm not familiar with unicode, I did a search and found from the brief articles that I've read:

In Java 1.4, I think it would be denoted as:

[^\\p{InHighSurrogates}\\p{InLowSurrogates}]

And in Java 1.5, I think it would be denoted as:

[\ud800\udc00-\udbff\udfff]

I'm having trouble getting them all to behave as if I've programmatically "&&"'ed the above conditions together. Any help/suggestions would be GREATLY appreciated.

Thanks --

[1340 byte] By [blazaha] at [2007-11-27 5:19:49]
# 1
It seems to me that your first two conditions are all that matters since if you meet them you must meet the third. I think your regex should be"\\p{Alpha}\\w*"
sabre150a at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 2

Somehow my SDN account got messed up so I'm finally replying -

I tested out the suggested expression and it works for all cases except when valid unicode chars are included in the string (e.g. capital Greek letter Gamma -- \u0393). The regex has to exclude unicode supplemental chars but be able to allow the pattern to match the string when it includes other unicode chars.

Is there a way to tweak the expression?

This is what I was having trouble with....the unicode stuff.

blazah_38a at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 3
See java.lang.Character.UnicodeBlock class documentation and use \p{...} notation for regex pattern.
hiwaa at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 4
> Is there a way to tweak the expression?> "[a-zA-Z][a-zA-Z0-9_]*"
sabre150a at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 5

Do you have to do this in a single regex? I think the easiest way to filter out supplemental characters would be with a loop like this for (int i = 0; i < str.length(); i++)

{

if (str.codePointAt(i) > 0xFFFF)

{

// bzzzt!

}

}

In fact, you might as well do the whole validation this way, using the Character methods isLetter() and isLetterOrDigit() to check for valid characters.

uncle_alicea at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 6
What am I missing? I still see the constraints imposed by parts 1 and 2 automatically enforce the constraint imposed by part 3. Parts 1 and 2 are easy to deal with using regex.
sabre150a at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 7

The way I read it, when the OP said [a-zA-Z0-9_] in the first post, he really meant any Unicode letter or digit, not just the ASCII subset. Either that or the requirement changed, because in reply #2 he said Greek letters were also valid. Aside from that, he wants to disallow characters outside the BMP, i.e., supplementary characters.

I just looked over the Character and String docs again, and I see that (as of JDK 1.5) they provide all the methods you need (short of String.hasSupplememtaryCharacters(), that is) int len = str.length();

if (len != str.codePointCount(0, len))

{

// invalid: has supplementary characters

}

// this is probably still easier than using a regex

for (int i = 0; i < len; i++)

{

char ch = str.charAt(i);

// test with Character.isXXX() methods

}

uncle_alicea at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 8
Having read #2 again I can now see what I was missing. The only real problem I see is knowing which is the valid set of characters because although the OP has hinted at them he has not defined them.I still think a regex will do the job but ...
sabre150a at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...
# 9
> The only real problem I see is knowing which> is the valid set of characters...Yeah, that's one of my reasons for recommending a non-regex approach. Even with complete specs, though, it would be a pretty hideous regex.
uncle_alicea at 2007-7-12 10:43:19 > top of Java-index,Java Essentials,Java Programming...