Like a regex, but readable?
Not to raise the ire of uncle_alice, but try as I might to read regex
tutorials, any sufficiently long regex pattern string is indistinguishable
from magic to me. It's unreadable.
So I'm wondering if code that step-by-step constructs a regex *object*,
without resorting to string symbols, would be more comprehensible?
Are there any example of this being done in any library? Or should I
just drink the Kool-Aid?
[457 byte] By [
Hippolytea] at [2007-11-27 6:10:37]

I started reading one yesterday as well, and I can definately feel your pain, but at the same time what would you say to someone who came up to you and stated that java is too difficult to learn - where are all those computers that are supposed to be able to do what I tell them to do with voice commands.
I think your best bet as mine is is to build slowly on what you do have and continuously add in more and more pieces until you have a bigger picture - that or until the point at which it looks like something like a pet running across a keyboard when you just write UNCLE !! ;-P
> Or should I> just drink the Kool-Aid?Drink the kool aid then roll your own...
> what would you say to someone who came up to you and> stated that java is too difficult to learn By that logic, we should all just go back to using the original vacuum-tube interface to using a computer :). Who needs computer programming languages anyway?
> Not to raise the ire of uncle_alice, but try as I
> might to read regex
> tutorials, any sufficiently long regex pattern string
> is indistinguishable
> from magic to me. It's unreadable.
>
> So I'm wondering if code that step-by-step constructs
> a regex *object*,
> without resorting to string symbols, would be more
> comprehensible?
> Are there any example of this being done in any
> library? Or should I
> just drink the Kool-Aid?
Something along the lines of, say, Hibernate's Criteria API? Yeh, that'd be handy. Project? One day I will learn this voodoo, I hate feeling dumb
> > what would you say to someone who came up to you
> and
> > stated that java is too difficult to learn
>
> By that logic, we should all just go back to using
> the original vacuum-tube interface to using a
> computer :). Who needs computer programming
> languages anyway?
I just meant that many things are very difficult when you first start out, but often worth it in the end. I'm not quite sure of what you are trying to convey unless you are going on the whole "standing on the shoulders of giants" approach.
I second that request... but I would be concerned about how much coding would be required to do something fairly simple, let alone something very complex.
> So I'm wondering if code that step-by-step constructs
> a regex *object*,
So, you mean like this?
// "^(\\p{Alnum}|)\\w+ ... etc. "
RE re = new RE();
re.addRule(RE.startsWith(RE.alphanum));
re.addLogicalOr();
re.addRule(RE.startsWith("");
re.matchOneOrMore(RE.whitesapace);
... etc.
I'd be surprised if you could find a scheme that carries the power of regex without become so verbose as to be even harder to read.
jverda at 2007-7-12 17:16:11 >

> > So I'm wondering if code that step-by-step
> constructs
> > a regex *object*,
>
> So, you mean like this?
> > // "^(\\p{Alnum}|)\\w+ ... etc. "
>
> RE re = new RE();
> re.addRule(RE.startsWith(RE.alphanum));
> re.addLogicalOr();
> re.addRule(RE.startsWith("");
> re.matchOneOrMore(RE.whitesapace);
> ... etc.
>
> I'd be surprised if you could find a scheme that
> carries the power of regex without become so verbose
> as to be even harder to read.
I expect you're right. For us mere mortal regex-dummies, things can only get so unreadable before we stop noticing it get any worse!
> Not to raise the ire of uncle_alice, but try as I
> might to read regex
> tutorials, any sufficiently long regex pattern string
> is indistinguishable
> from magic to me. It's unreadable.
For me I suppose it depends what you mean by 'long'.
At least sometimes people either get to clever with their regexes or they are creating a parsing (or attempting to do so.) The first does nothing more than demonstrate that they know nothing about maintanaince. The second demonstrates that they know nothing about performance.
Actually, I agree: regexes cram too much meaning into too few characters, and end up looking like chicken scratch. Also, they were originally designed to match "regular" grammars (whatever they are), while we usually use them on non-regular grammars like natural languages, programming languages, HTML, etc.. Thanks to that impedence mismatch, some tasks that are conceptually simple are difficult to perform, if not impossible.
Trouble is, there's nothing else out there that's as powerful, general, and convenient as regexes. The only thing I know of that could completely replace regexes is a parser generator like Antlr or JavaCC, but those are way too heavy for most tasks. There are projects, like the two listed below, that try to make it easier to generate parsers, but they still lose out to regexes on the convenience front.
http://bnf-for-java.sourceforge.net/
http://jtopas.sourceforge.net/jtopas/index.html
I'm afraid the only thing that's going to be able to supplant regexes are better regexes, like the ones the Perl community is working on:
http://dev.perl.org/perl6/doc/design/syn/S05.html
But Ipthar help the schmuck who gets the job of shoehorning those into Java! ^_^
> RE re = new RE();
>...
> I'd be surprised if you could find a scheme that
> carries the power of regex without become so verbose
> as to be even harder to read.
But it might be easier to learn. I think the terseness of regular expressions is what makes it difficult for newbies.
Also something like your example could be tooled pretty easily. Then maybe it could be tooled to show a graphical display of what it's doing (useful for newbies) or to provide some profiling info (useful for optimizers).
I learned regular expressions in college in theoretical sense of programming language design, before I encountered it in any real depth in a real-world language. That seemed to make it easier. Maybe there ought to be a regular expression tutorial that takes that approach, to provide some background, or something.
> Maybe there ought to be a regular expression
> tutorial that takes that approach, to provide some
> background, or something.
Seems there's a bunch out there. Try this:
http://www.google.com/search?hl=en&ned=us&q=regular+expression+tutorial&btnmeta%3Dsearch%3Dsearch=Search+the+Web
I have no clue how good/bad they are. Regex expressions are truly a pain to master but when you do, you will feel like a wizard and you won't need the kool aid anymore ;-) It's somewhat akin to reading reverse polish notation as if reading a book. You need to train the eye to recognize the major groups and work your way into the expression.
Someone at the following thread:
http://forum.java.sun.com/thread.jspa?threadID=5178789&tstart=0
recommends this one:
http://www.regular-expressions.info/
Message was edited by:
orbach
> > RE re = new RE();
> >...
> > I'd be surprised if you could find a scheme that
> > carries the power of regex without become so
> verbose
> > as to be even harder to read.
>
> But it might be easier to learn. I think the
> terseness of regular expressions is what makes it
> difficult for newbies.
>
> Also something like your example could be tooled
> pretty easily. Then maybe it could be tooled to show
> a graphical display of what it's doing (useful for
> newbies) or to provide some profiling info (useful
> for optimizers).
>
> I learned regular expressions in college in
> theoretical sense of programming language design,
> before I encountered it in any real depth in a
> real-world language. That seemed to make it easier.
> Maybe there ought to be a regular expression
> tutorial that takes that approach, to provide some
> background, or something.
There are graphical regex builders around. Never used 'em though
well i mean an alternative 'thing' to regex's is SQL, say.
i.e. have your data in some type of format.
c# is getting 'lambda' expressions; and LINQ. maybe that interests you.
but i don't see the big deal with regexs. i've used them every so often, and it's fun, really. i can't think of an instance where i've needed one that is amazingly complicated.
> but i don't see the big deal with regexs. i've used
> them every so often, and it's fun, really. i can't
> think of an instance where i've needed one that is
> amazingly complicated.
((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*(?<angle><))?((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])(?(angle)>)$
Taken from http://regexlib.com/REDetails.aspx?regexp_id=711 .
> Taken from
> http://regexlib.com/REDetails.aspx?regexp_id=711 .
yeah, well(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:
(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.
|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=
[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(
?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \
x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\
[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])
*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
))*|(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(
?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*
"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".
\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?
[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\
Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
\t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]
\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()
<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@
,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?
:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\
n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?
:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()
<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\
r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x
1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[
\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^
()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<
>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(
?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r
\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(
?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:
(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\
["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:
(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \
x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:
(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?
[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(
?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x
00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[
([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\]
\x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))
|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(
?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*
\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?
[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]
+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^(
)<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\
n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x
1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\
"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\
n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n
)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1
F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s
*(?:(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(
?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*
"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[
\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]
))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*
@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r
\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]
(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\
] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]])
)|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".
\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)
*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\
\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\
\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\
r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\
x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\
[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?
:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?
:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x0
0-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?
:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:(?:\r\n)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \x00
-\x1F]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*
)?;\s*)
from http://www.regular-expressions.info/email.html
Perl 5's RE's allow a regex to be split over several lines, so you can insert meaningful comments.
I agree that RE's present the user with uniquely compact power... but I also agree that complex RE's present a maintenance problem, simply because they are notoriously difficult to read.
I always put comments in any RE which contains two or more "phrases" ... and I appreciate it when my colleagues are so thoughtful.
[url="http://java.sun.com/docs/books/tutorial/essential/regex/pattern.html"]Pattern.COMMENTS[/url] Permits whitespace and comments in the pattern. In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line. Comments mode can also be enabled via the embedded flag expression (?x).
Cheers. Keith.
> (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \x00-\x1F]+(?:(?:
> (?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.
> |(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
.....
Don't they have a methadone program in your part of the world? :-)
> Perl 5's RE's allow a regex to be split over several
> lines, so you can insert meaningful comments.
I don't find the comments useful. I prefer to build the regex out of sub-regexs (see below).
>
> I agree that RE's present the user with uniquely
> compact power... but I also agree that complex RE's
> present a maintenance problem, simply because they
> are notoriously difficult to read.
Yep - that is why I like the sub-regex approach.
<snip/>
An email address regex I wrote some time ago (as an exercise) using the sub-regex approach. Please don't use this regex as it needs much better testing than I have given it.
private static final Pattern emailMatcher;
static
{
// Build up the regular expression according to RFC821
// http://www.ietf.org/rfc/rfc0821.txt
// <x> ::= any one of the 128 ASCII characters (no exceptions)
String x_ = "\u0000-\u007f";
// <special> ::= "<" | ">" | "(" | ")" | "[" | "]" | "\" | "."
// | "," | ";" | ":" | "@" """ | the control
// characters (ASCII codes 0 through 31 inclusive and
// 127)
String special_ = "<>()\\[\\]\\\\\\.,;:@\"\u0000-\u001f\u007f";
// <c> ::= any one of the 128 ASCII characters, but not any
// <special> or <SP>
String c_ = "[" + x_ + "&&" + "[^" + special_ + "]&&[^ ]]";
// <char> ::= <c> | "\" <x>
String char_ = "(?:" + c_ + "|\\\\[" + x_ + "])";
// <string> ::= <char> | <char> <string>
String string_ = char_ + "+";
// <dot-string> ::= <string> | <string> "." <dot-string>
String dot_string_ = string_ + "(?:\\." + string_ + ")*";
// <q> ::= any one of the 128 ASCII characters except <CR>,
//<LF>, quote ("), or backslash (\)
String q_ = "["+x_+"$$[^\r\n\"\\\\]]";
// <qtext> ::= "\" <x> | "\" <x> <qtext> | <q> | <q> <qtext>
String qtext_ = "(?:\\\\[" + x_ + "]|" + q_ + ")+";
// <quoted-string> ::= """ <qtext> """
String quoted_string_ = "\"" + qtext_ + "\"";
// <local-part> ::= <dot-string> | <quoted-string>
String local_part_ = "(?:(?:" + dot_string_ + ")|(?:" + quoted_string_ + "))";
// <a> ::= any one of the 52 alphabetic characters A through Z
// in upper case and a through z in lower case
String a_ = "[a-zA-Z]";
// <d> ::= any one of the ten digits 0 through 9
String d_ = "[0-9]";
// <let-dig> ::= <a> | <d>
String let_dig_ = "[" + a_ + d_ + "]";
// <let-dig-hyp> ::= <a> | <d> | "-"
String let_dig_hyp_ = "[-" + a_ + d_ + "]";
// <ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
// String ldh_str_ = let_dig_hyp_ + "+";
// RFC821 looks wrong since the production "<name> ::= <a> <ldh-str> <let-dig>"
// forces a name to have at least 3 characters and country codes such as
// uk,ca etc would be illegal! I shall change this to make the
// second term of <name> optional by make a zero length ldh-str allowable.
String ldh_str_ = let_dig_hyp_ + "*";
// <name> ::= <a> <ldh-str> <let-dig>
String name_ = "(?:" + a_ + ldh_str_ + let_dig_ + ")";
// <number> ::= <d> | <d> <number>
String number_ = d_ + "+";
// <snum> ::= one, two, or three digits representing a decimal
// integer value in the range 0 through 255
String snum_ = "(?:[01]?[0-9]{2}|2[0-4][0-9]|25[0-5])";
// <dotnum> ::= <snum> "." <snum> "." <snum> "." <snum>
String dotnum_ = snum_ + "(?:\\." + snum_ + "){3}"; // + Dotted quad
// <element> ::= <name> | "#" <number> | "[" <dotnum> "]"
String element_ = "(?:" + name_ + "|#" + number_ + "|\\[" + dotnum_ + "\\])";
// <domain> ::= <element> | <element> "." <domain>
String domain_ = element_ + "(?:\\." + element_ + ")*";
// <mailbox> ::= <local-part> "@" <domain>
String mailbox_ = local_part_ + "@" + domain_;
emailMatcher = Pattern.compile(mailbox_);
System.out.println("Email address regex = " + emailMatcher);
}
sabre,I simply never thought of doing that. Cool. Thank you.Cheers mate, Keith.
> > but i don't see the big deal with regexs. i've
> used
> > them every so often, and it's fun, really. i can't
> > think of an instance where i've needed one that is
> > amazingly complicated.
>
> ((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+\x20*|"((?=[\x01-\x
> 7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*(?<angle><))?((?!\.)
> (?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_`{|}~]+)+|"((?=[\x01-\x7
> f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)
> \.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01
> ]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[
> ^\\\[\]]|\\[\x01-\x7f])+)\])(?(angle)>)$
>
> Taken from
> http://regexlib.com/REDetails.aspx?regexp_id=711 .
Sabre, you do realize you could be shot for that in different parts of the
world, do you?
REs have gone way beyond the limits they should've stayed in. Why
haven't they used cattle prods against those monsters?
kind regards,
Jos
JosAHa at 2007-7-21 21:44:20 >

> REs have gone way beyond the limits they should've
> stayed in. Why
> haven't they used cattle prods against those
> monsters?
Given any technology, someone will take it over it's natural limit and if it does not actually fall over then that will become the norm.
One place I worked used Tazers on people who wrote regexes longer than 10 characters. I suspect that my brain got fried and that is why I had to retire.
Hippolyte , how about RegexBuddy? http://www.regular-expressions.info/regexbuddy.htmlI've never used it but looks interesting.or PowerGrep? http://www.powergrep.com/there's nothing new under the sun anymore :(Message was edited by: orbach
Is there some sort of contest going on?
''=~('(?{'.('`'|'%').('['^'-')
.('`'|'!').('`'|',').'"'.'\\$'
.'=='.('['^'+').('`'|'/').('['
^'+').'||'.(';'&'=').(';'&'=')
.';-'.'-'.'\\$'.'=;'.('['^'(')
.('['^'.').('`'|'"').('!'^'+')
.'_\\{'.'(\\$'.';=('.'\\$=|'."\|".('`'^'.'
).(('`')|'/').').'.'\\"'.+('{'^'[').('`'|'"').('`'|'/'
).('['^'/') .('['^'/'). ('`'|',').( '`'|('%')). '\\".\\"'.( '['^('(')).
'\\"'.('['^ '#').'!!--' .'\\$=.\\"' .('{'^'['). ('`'|'/').( '`'|"\&").(
'{'^"\[").( '`'|"\"").( '`'|"\%").( '`'|"\%").( '['^(')')). '\\").\\"'.
('{'^'[').( '`'|"\/").( '`'|"\.").( '{'^"\[").( '['^"\/").( '`'|"\(").(
'`'|"\%").( '{'^"\[").( '['^"\,").( '`'|"\!").( '`'|"\,").( '`'|(',')).
'\\"\\}'.+( '['^"\+").( '['^"\)").( '`'|"\)").( '`'|"\.").( '['^('/')).
'+_,\\",'.( '{'^('[')). ('\\$;!').( '!'^"\+").( '{'^"\/").( '`'|"\!").(
'`'|"\+").( '`'|"\%").( '{'^"\[").( '`'|"\/").( '`'|"\.").( '`'|"\%").(
'{'^"\[").( '`'|"\$").( '`'|"\/").( '['^"\,").( '`'|('.')). ','.(('{')^
'[').("\["^ '+').("\`"| '!').("\["^ '(').("\["^ '(').("\{"^ '[').("\`"|
')').("\["^ '/').("\{"^ '[').("\`"| '!').("\["^ ')').("\`"| '/').("\["^
'.').("\`"| '.').("\`"| '$')."\,".( '!'^('+')). '\\",_,\\"' .'!'.("\!"^
'+').("\!"^ '+').'\\"'. ('['^',').( '`'|"\(").( '`'|"\)").( '`'|"\,").(
'`'|('%')). '++\\$="})' );$:=('.')^ '~';$~='@'| '(';$^=')'^ '[';$/='`';
http://www.99-bottles-of-beer.net/language-perl-737.html
oops, jumped the gun, powergrep doesn't analyze regex expressions...
> Is there some sort of contest going on?
lol. Some people have too much time... I liked the man page for Acme::EyeDrops where it says for one of the examples:
This is a Visual Programming breakthrough in that you can tell it is a Windows program and see its UML structure too, just by glancing at the code.
referring to this program below. As you can tell it is obviously a windows program and its UML is obvious too :-P
''=~('('.'?'.'{'.('`'|'%').('['^'-').(
((
((
((
((
((
'`'))))))))))|'!').('`'|',').'"'.('['^
(
( (
((
'+'))))
)
)
.('['^')').('`'|')').('`'|'.').(('[')^
((
((
'/'))))).('{'^'[').'\\'.('"').('`'|'(').('`'|'%').('`'|"\,").(
((((
((((
((((
((((
((((
'`'))))))))))))))))))))|"\,").('`'|'/').('{'^'[').('['^"\,").(
'`'|'/').('['^')').('`'|',').('`'|'$').'\\'.'\\'
.('`'|'.').'\\'.'"'.';'.('!'^'+').'"'.'}'."\)");
$:='.'^'~';$~='@'|'(';$^=')'^'[';$/='`'|"\.";$,=
"\("^(('}'))
;($\)=('`')|
"\!";$:=')'^
"\}";$~='*'|
"\`";$^='+'^
"\_";$/='&'|
"\@";$,='['&
"\~";$\=','^
"\|";$:='.'^
"\~";$~='@'|
"\(";$^=')'^
'[';$/='`'|'.';$,='('^'}';$\='`'|'!';$:=')'^'}';
($~)=(('*'))
|'`';$^='+'^
"\_";$/='&'|
"\@";$,='['&
"\~";$\=','^
"\|";$:='.'^
"\~";$~='@'|
"\(";$^=')'^
"\[";$/='`'|
"\.";$,='('^
"\}";$\='`'|
'!';$:=')'^'}';$~='*'|'`';$^='+'^'_';$/='&'|'@';
$,='['&'~';$\=','^'|';$:='.'^'~';$~='@'|"\(";$^=
')'^'[';$/='`'|'.';$,='('^'}';$\='`'|'!';$:=')';
On the subject of taking things too far, EditPad Pro (by the same author as RegexBuddy and PowerGrep) uses regexes to do both syntax highlighting and file navigation (i.e., code folding and tree view). It provides the framework plus pre-built schemes for the most common languages, and users are free to create their own schemes. The pre-built file navigation scheme for Java didn't take generics, enums and annotations into account, so I wrote one that does. Here's the regex for a simple field declaration:
^[\x20\t]*+(?:(?:final|private|protected|public|
static|transient|volatile|@(?!interface\b)[\w.]++)\s++)*+
\w++(?:\.\w++)*+(?:<(?:[^>]++|>(?=\s*+[,&>]))*+>|(?:\s*+
\[\s*+\])*+)\s++(\w++)(?:\s*+\[\s*+\])*+\s*+(?![\w(])
(?:[^;{'"/]++|'(?:[^'\\]++|\\.)*+'|"(?:[^"\\]++|\\.)*+
"|//.*+|/\*(?:[^*]++|\*(?!/))*+)*+;
For someone who is already pretty well-versed in regexes and the grammar of the language in question, writing these schemes is like taking an advanced course in both. For anyone else, it's a good way to do penance for your sins. ^_^
