regex and unicode
Hi,
I want to create a regex which I can use to validate a user inputted string (i.e. text entered in a JTextField). I'm familiar with the Pattern and Matcher classes, and I've put them inside of a new class that extends javax.swing.InputVerifier. I set the JTextField's input verifier to this new class. The problem is coming up with the regex to use when compiling the Pattern.
I know what the regex conditions are, but I can't seem to combine all of the elements into a single valid regex and it is driving me nuts!
(1) The object name cannot contain non-negative numbers, hyphens/dashes or periods. It can only contain:
[a-zA-Z0-9_]
(2) To further complicate matters, the first character in the name MUST be a letter.
\A[a-zA-Z]
(3) AND....the name cannot contain supplemental unicode chars. Since I'm not familiar with unicode, I did a search and found from the brief articles that I've read:
In Java 1.4, I think it would be denoted as:
[^\\p{InHighSurrogates}\\p{InLowSurrogates}]
And in Java 1.5, I think it would be denoted as:
[\ud800\udc00-\udbff\udfff]
I'm having trouble getting them all to behave as if I've programmatically "&&"'ed the above conditions together. Any help/suggestions would be GREATLY appreciated.
Thanks --
[1340 byte] By [
blazaha] at [2007-11-27 5:19:49]

Somehow my SDN account got messed up so I'm finally replying -
I tested out the suggested expression and it works for all cases except when valid unicode chars are included in the string (e.g. capital Greek letter Gamma -- \u0393). The regex has to exclude unicode supplemental chars but be able to allow the pattern to match the string when it includes other unicode chars.
Is there a way to tweak the expression?
This is what I was having trouble with....the unicode stuff.
Do you have to do this in a single regex? I think the easiest way to filter out supplemental characters would be with a loop like this for (int i = 0; i < str.length(); i++)
{
if (str.codePointAt(i) > 0xFFFF)
{
// bzzzt!
}
}
In fact, you might as well do the whole validation this way, using the Character methods isLetter() and isLetterOrDigit() to check for valid characters.
The way I read it, when the OP said [a-zA-Z0-9_] in the first post, he really meant any Unicode letter or digit, not just the ASCII subset. Either that or the requirement changed, because in reply #2 he said Greek letters were also valid. Aside from that, he wants to disallow characters outside the BMP, i.e., supplementary characters.
I just looked over the Character and String docs again, and I see that (as of JDK 1.5) they provide all the methods you need (short of String.hasSupplememtaryCharacters(), that is) int len = str.length();
if (len != str.codePointCount(0, len))
{
// invalid: has supplementary characters
}
// this is probably still easier than using a regex
for (int i = 0; i < len; i++)
{
char ch = str.charAt(i);
// test with Character.isXXX() methods
}