Regular Expression Backtracking Problem

I have a complex regular expression along the lines of

[extremely_long_expression_with_lots_of_wildcards]abcdefg

It takes several minutes for the Java regex parser to determine that the string against which I'm matching the expression does not contain 'abcdefg'. It appears to be "backtracking" and trying to find some sort of match in the long expression, even though the complete expression cannot match due to the absence of 'abcdefg'.

Is there a way to make the parser look first for 'abcdefg', notice that it's not in the string, and fail immediately?

[589 byte] By [rcauvina] at [2007-11-26 22:24:29]
# 1
Ever tried String.endsWith() or indexOf(), instead of complicating that regex even more? :)
CeciNEstPasUnProgrammeura at 2007-7-10 11:24:35 > top of Java-index,Java Essentials,Java Programming...
# 2
I think you need to provide a bit more detail. You could use a 'positive lookahead' but I suspect the problem is more complex than you have described so this may not be effective.Why don't you post the regex here?
sabre150a at 2007-7-10 11:24:35 > top of Java-index,Java Essentials,Java Programming...
# 3
For regular expression pattern matchingThere are Greedy, Reluctant and Possessive which give different behavior and performance. Beside using endWith and indexOf, you may have a look on this.
rym82a at 2007-7-10 11:24:35 > top of Java-index,Java Essentials,Java Programming...
# 4

Here is one of the actual patterns:

<b>From:</b>\s*</td>\s*<td [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){30,50}[^\n]*?<b>To:</b>\s*</td>\s*<td [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){12,22}[^\n]*?- begin message -(?:[^\n]*?\n){111,131}[^\n]*?- received -..\s*<table [^>]*?>\s*<tr>\s*<td [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){5,15}[^\n]*?- sent -..\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){9,15}[^\n]*?end.gif[^>]*?/>\s*</td>\s*<td [^>]*?>\s*(.+?)\s*</td>\s*<td [^>]*?>\s*(?:<img [^>]*?Msg_Images/)?(.+?)(?:.gif"\s*[^>]*?>)?\s*</td>\s*<td [^>]*?>\s*<img [^>]*?/>\s*</td>\s*<td [^>]*?>\s*(.+?)\s*</td><input type="hidden" name="numOfParts" value='1' />

The <input> element at the end of this pattern is not in the string I'm matching, so I want the pattern to fail immediately.

rcauvina at 2007-7-10 11:24:35 > top of Java-index,Java Essentials,Java Programming...
# 5
"Ever tried String.endsWith() or indexOf(), instead of complicating that regex even more? :)"I'm plugging into an existing framework that uses regular expressions from a configuration file. I can change these expressions but not the Java code that parses them.
rcauvina at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 6

There a HUGE potential for backtracking in that regex. Fragments like

(?:[^\n]*?\n){30,50}[^\n]*?<b>

have problems. This says 'greedily read 30 to 50 lines' so it will first read 50 lines and look for a line with <b> and if it does not find it after 50 it will backtrack a line and try again and so on. You would probably do better to make the {30,50} reluctant rather than the preceding non-capturing group.

Since the whole is made up of fragments like the above the potential for backtracking is enormous.

sabre150a at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 7

"There a HUGE potential for backtracking in that regex."

I know :-) Good point about the greediness. I'll try making them reluctant.

But it all keys off the last part of the pattern. If the <input> element occurs in the string, I know for sure the regular expression as a whole will match (quickly). If the <input> element does not occur in the string, I know for sure the regular expression will not match.

So all I need is a way to get the parser to check the end of the pattern (the <input> element) before the complex part that's prone to backtracking.

rcauvina at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 8

> "There a HUGE potential for backtracking in that

> regex."

>

> I know :-)

>

> But it all keys off the last part of the pattern.

I don't agree. Yes, you need that last part of the pattern but even if you know that that part exists in the text you will still do a huge amount of backtracking to get to that point. As I said, you can use 'positive lookahead' but I don't think it will solve you problems.

It is not obvious to me what you are trying to extract (or just match). I'm certain that the regex can be re-factored so as not to do the backtracking but the actual form will depend on the object of the regex.

sabre150a at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 9

"Yes, you need that last part of the pattern but even if you know that that part exists in the text you will still do a huge amount of backtracking to get to that point."

As the expression stands, yes.

But the whole point of this question is to find out how to modify the pattern to get the parser to check the last part first, recognize when it fails, and not bother to check the first part of the pattern that backtracks.

If I put the last part of the pattern first, for example, it fails immediately. This fact shows that the parser needn't necessarily do backtracking in order to recognize that the pattern fails.

rcauvina at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 10

> But the whole point of this question is to find out

> how to modify the pattern to get the parser to check

> the last part first,

Which is why I suggest 'positive lookahead'!

> recognize when it fails, and not

> bother to check the first part of the pattern that

> backtracks.

>

> If I put the last part of the pattern first, for

> example, it fails immediately. This fact shows that

> the parser needn't necessarily do backtracking in

> order to recognize that the pattern fails.

But my point is that even if you first check that the last part matches and it passes then the rest of your pattern will still backtrack heavily once you get past the initial check.

sabre150a at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 11

"Which is why I suggest 'positive lookahead'!"

Thanks, how do you suggest I incorporate positive lookahead?

"But my point is that even if you first check that the last part matches and it passes then the rest of your pattern will still backtrack heavily once you get past the initial check."

When the expression matches, it does so quickly. I've only experienced significant delays when the expression doesn't match.

rcauvina at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 12

> "Which is why I suggest 'positive lookahead'!"

>

> Thanks, how do you suggest I incorporate positive

> lookahead?

Along the lines of -

String page = "afjkdfa dfjasdjkf zzzz lkhafkjdsfkjsls";

System.out.print(page.matches("(?=.*?zzzz ).*"));

sabre150a at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 13
Hmm. Can you be more specific? Let's say the pattern was:a.+?f zzzzand I wanted it to match the page string you specified. How can I modify the pattern to eliminate the backtracking (i.e. check for 'zzzz' first and not bother with the 'a.+?f' part if it fails)?
rcauvina at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 14
System.out.print(page.matches("(?=.*?zzzz )a.+?f zzzz.*"));
sabre150a at 2007-7-10 11:24:36 > top of Java-index,Java Essentials,Java Programming...
# 15

Okay, so now I have

(?=.*?<input type="hidden" name="numOfLegs" value='1' />)<b>From:</b>\s*</td>\s*<td [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){30,50}[^\n]*?<b>To:</b>\s*</td>\s*<td [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){12,22}[^\n]*?- begin message -(?:[^\n]*?\n){111,131}[^\n]*?- received -..\s*<table [^>]*?>\s*<tr>\s*<td [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){5,15}[^\n]*?- sent -..\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){9,15}[^\n]*?end.gif[^>]*?/>\s*</td>\s*<td [^>]*?>\s*(.+?)\s*</td>\s*<td [^>]*?>\s*(?:<img [^>]*?Msg_Images/)?(.+?)(?:.gif"\s*[^>]*?>)?\s*</td>\s*<td [^>]*?>\s*<img [^>]*?/>\s*</td>\s*<td [^>]*?>\s*(.+?)\s*</td>.*?<input type="hidden" name="numOfParts" value='1' />

Now, for an expression that DOES match, it takes several minutes. If I eliminate the lookahead, it matches immediately. But then nonmatching expressions take forever.

rcauvina at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 16

> Okay, so now I have

>

> (?=.*?<input type="hidden" name="numOfLegs" value='1'

> />)<b>From:</b>\s*</td>\s*<td

> [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){30,50}[^\n]*?

> <b>To:</b>\s*</td>\s*<td

> [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){12,22}[^\n]*?

> - begin message -(?:[^\n]*?\n){111,131}[^\n]*?-

> received -..\s*<table [^>]*?>\s*<tr>\s*<td

> [^>]*?>\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){5,15}[^\n]*?-

> sent

> -..\s*([^\n]+?)\s*</td>(?:[^\n]*?\n){9,15}[^\n]*?end.g

> if[^>]*?/>\s*</td>\s*<td

> [^>]*?>\s*(.+?)\s*</td>\s*<td [^>]*?>\s*(?:<img

> [^>]*?Msg_Images/)?(.+?)(?:.gif"\s*[^>]*?>)?\s*</td>\s

> *<td [^>]*?>\s*<img [^>]*?/>\s*</td>\s*<td

> [^>]*?>\s*(.+?)\s*</td>.*?<input type="hidden"

> name="numOfParts" value='1' />

>

> Now, for an expression that DOES match, it takes

> several minutes. If I eliminate the lookahead, it

> matches immediately. But then nonmatching

> expressions take forever.

I'm surprised at this! I think you will have to totally re-factor your regex OR wait for uncle_alice to assist.

sabre150a at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 17

(?=.*?<input type="hidden" name="numOfLegs" value='1' />) The reluctant quantifier is what makes this part so slow. You know the part you're looking for will be at the end of the string (that's the whole point), so you want the .* to consume the whole input initially, then backtrack: (?=.*<input type="hidden" name="numOfLegs" value='1' />) The way a reluctant quantifier works is, each time it's supposed to try to match, it first tries to let the next part of the regex match instead. So it's effectively doing a lookahead at the beginning of each iteration, which can get pretty expensive, especially when the quantified part only matches one character per iteration, like .*?. As a matter of fact, in most places where you're using reluctant quantifiers in your regex, the quantified part can't possibly match the same thing as the next part, for example: (?:[^\n]*?\n) The reluctant quantifier only serves to slow this regex down. What you should be using is a possessive quantifier: (?:[^\n]*+\n) Wherever you find yourself looking for "an unknown number of not-X, followed by X", you should be using a possessive quantifier. In such a situation a normal, greedy-but-accomodating quantifier will slow things down when no match is possible, but a reluctant quantifier makes a successful match much slower than it needs to be. Also, everywhere that you have \s* followed by a tag or by [^\n], you should change it to \s*+. Anytime you know the quantified part can never match the same text as the next part of the regex, you should use a possessive quantifier. However, in the case of [^>]*?/>, you should just remove the question mark. Making that quantifier possessive would break your regex, but you know [^>]*/> will only ever have to backtrack one character.

If you make those changes, I think you'll find that the lookahead isn't necessary after all.

uncle_alicea at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 18

Great practical advice, uncle_alice. I revamped the expression and now have:

<b>From:</b>\s*+</td>\s*+<td [^>]*+>\s*+([^\n\r]++)\s*+</td>.*<b>To:</b>\s*+</td>\s*+<td [^>]*+>\s*+([^\n\r]++)\s*+</td>.*- begin message -.*- sent -..\s*+<table [^>]*+>\s*+<tr>\s*+<td [^>]*+>\s*+([^\n\r]++)\s*+</td>.*- received -..\s*+([^\n\r]++)\s*+</td>\s*+</tr>\s*+</table>\s*+</td>\s*+<td .*?></td>\s*+<td [^>]*+>\s*+(.+?)\s*+</td>\s*+<td [^>]*+>\s*+(.+?)\s*+</td>\s*+<td [^>]*+>\s*+<img [^>]*/>\s*+</td>\s*+<td [^>]*+>\s*+(.+?)\s*+</td>

It's slightly sluggish when matching but is probably good enough for my purposes. You'll notice that I kept a few reluctant quantifiers; I actually got OutOfMemory errors if I didn't make them reluctant.

rcauvina at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 19

> "Ever tried String.endsWith() or indexOf(), instead

> of complicating that regex even more? :)"

>

> I'm plugging into an existing framework that uses

> regular expressions from a configuration file. I can

> change these expressions but not the Java code that

> parses them.

You should consider filing an RFE on that framework to suggest additional functionality. I suspect you are using it in a way that the authors did not invision. So to achieve acceptable performance something else is needed.

That you found a solution this time does not mean that it will happen the next time (particularly given the complexity of the regex.)

As an analogy database frameworks often (perhaps always these days) provide methods to pass SQL straight through rather than using the framework to construct what it normally does. The reason for this is that the generic framework is not capable of dealing with all possible scenarios while a custom SQL statement call allow significant performace improvements.

jschella at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 20

"That you found a solution this time does not mean that it will happen the next time (particularly given the complexity of the regex.)"

Thanks. More flexibility in the framework would be nice.

However, the framework has worked for everything that I've thrown at it; it's just been a matter of getting the regular expressions right. We're now up to our 7th project and about 65 out of 65 regular expressions that have worked.

If I give up on regular expressions in a particular case, I want it to be after I've thoroughly understood why it's a poor solution, not because I don't understand regular expressions well enough to come up with one that works.

rcauvina at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 21

Another way to boost performance of your regexes is to always use non-capturing parentheses if you're only using them for grouping. However, it looks like capturing is the the reason for all the parens in your regex. In fact, the purpose of <td [^>]*+>\s*+(.+?)\s*+</td> is obviously to capture the contents of the TD element, trimming any leading and trailing whitespace while you're at it. There are ways to avoid the perfomance penalty of the reluctant quantifier, but to find the best one, it helps to know what you're likely to find between those tags. If you know there will only be plain text (no tags), you can do this: <td\b[^>]*+>\s*+([^\s<]++(?:\s++[^\s<]++)*+)\s*+</td> If there might be tags, but not nested TD tags (i.e., no nested tables), it gets even uglier, but still doable. We're already pretty deep in "is it worth it" territory though, so I'll leave it at that unless you express a desire for more punishment. ^_^

uncle_alicea at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 22

Very interesting. I see where you're going with the removal of reluctant and even greedy qualifiers.

It turns out that some of the contents of the TD elements can contain tags (but not nested tables). You've been very helpful and have both improved my understanding of regular expressions and solved my problem. I am curious how you'd deal with the TD elements with tags if it's not too much trouble. Punish me :-)

rcauvina at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 23

Get ready for the regex equivalent of the cat o' nine tails! <td\b[^>]*+>\s*+((?:[^\s<]++|<(?!/?td\b))++(?:\s++(?:[^\s<]++|<(?!/?td\b))++)*+)\s*+</td> But seriously, let me break that down into a before-and-after shot of just the part that changed. Before: ([^\s<]++(?:\s++[^\s<]++)*+) After: ((?:[^\s<]++|<(?!/?td\b))++(?:\s++(?:[^\s<]++|<(?!/?td\b))++)*+) Most of that godawful complexity comes from the need to scrape off surrounding whitespace in the process of matching. Without that requirement, the Before regex would be ([^<]*+) and After, ((?:[^<]++|<(?!/?td\b))*+) I think a good candidate for inclusion in jschell's RFE would be automatic trimming. ;-)

uncle_alicea at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...
# 24
You are the master, uncle_alice :-)There actually is a mechanism for automatic trimming in the framework; for some reason I was masochistic and wanted to see what would happen if I did the trimming in the regular expression instead.
rcauvina at 2007-7-21 18:43:19 > top of Java-index,Java Essentials,Java Programming...