Improving Java Regular Expression Compile Time
Hi,
Just wondering if anyone here knows how can i improve the compile time of Java Regular Expression?
The following is fragment of my code which I tired to see the running time.
Calendar rightNow = Calendar.getInstance();
System.out.println("Compile Pattern");
startCompileTime = rightNow.getTimeInMillis();
Pattern p = Pattern.compile(reg, Pattern.CASE_INSENSITIVE);
rightNow = Calendar.getInstance();
endCompileTime = rightNow.getTimeInMillis();
Below is fragment of my regular expression:
(?:tell|state|say|narrate|recount|spin|recite|order|enjoin|assure|ascertain|demonstrate|evidence|distinguish|separate|differentiate|secern|secernate|severalize|tell apart) me (?:about|abou|asti|approximately|close to|just about|some|roughly|more or less|around|or so|almost|most|all but|nearly|near|nigh|virtually|well-nigh) java
My regular expression is a very long one and the Pattern.compile just take too long. The worst case that I experience is 2949342 milliseconds.
Any idea how can I optimise my regular expression so that the compilation time is acceptable.
Thanks in advance
[1162 byte] By [
ecco80a] at [2007-10-2 0:02:20]

How long are we talking about here?
> How long are we talking about here?Looks to me like it could take as long as 2949342 milliseconds. But I'm just guessing.To OP: Are you sure this is the problem? I.e., you profiled it and this is where the time is.~Cheers
The regular expression that I have copy into the discussion board is JUST a fragment. The regular expression that I have used to match is too long for the board.
I have tried to compile the same RE into Perl compiler and executed it.
The funny thing is that Perl ran the same RE faster than Java. And it is so much faster(i.e. return immediately)
Should compiled code ran faster than interpreted scripts
Any idea, anyone
Thanks
>
> Should compiled code ran faster than interpreted
> scripts
>
That isn't how it works. The regex in java is 'compiled' when the code runs not when the java code is compiled.
> Any idea, anyone
I suspect you should create a real parser rather than trying to use a regex. It sounds like your regex is going to be hideously hard to maintain. And a parser would probably be faster anyways.
If speed is the only criteria and you don't care about maintainance then use perl.
> > How long are we talking about here?> > Looks to me like it could take as long as 2949342> milliseconds. But I'm just guessing.I mean long in terms of characters.
> My regular expression is a very long one and the
> Pattern.compile just take too long. The worst case
> that I experience is 2949342 milliseconds.
Wow, that's pretty pathological. I was going to tell you that you were measuring something wrong, because I had written a test program that could compile a 1 Mb "or" pattern (10,000 words, 100 bytes per) in under 200 ms ... but then I noticed that your patterns have two "or" components, so reran my test, and got over 14 seconds to run with a smaller pattern.
My guess is that the RE compiler, rather than decomposing the RE into a tree, is taking the naive approach of translating it into a state machine, and replicating the second component for each path through the first component.
If you can create a simple hand-rolled parser, that may be your best option. However, it appears that your substrings aren't easily tokenized (some include spaces), so your best bet is to break the regexes into pieces at the "or" constructs, and use Pattern.split() to apply each piece sequentially.
import java.util.Random;
import java.util.regex.Pattern;
public class RegexTest
{
public static void main(String[] argv) throws Exception
{
long initial = System.currentTimeMillis();
String[] words = generateWords(10000);
//String patStr = buildRePortion(words);
//String patStr = buildRePortion(words) + " xxx ";
String patStr = buildRePortion(words) + " xxx " + buildRePortion(words);
long startCompile = System.currentTimeMillis();
Pattern pattern = Pattern.compile(patStr, Pattern.CASE_INSENSITIVE);
long finishCompile = System.currentTimeMillis();
System.out.println("Number of components = " + words.length);
System.out.println("ms to create pattern = " + (startCompile - initial));
System.out.println("ms to compile= " + (finishCompile - startCompile));
}
private final static String[] generateWords(int numWords)
{
String[] results = new String[numWords];
Random rnd = new Random();
for (int ii = 0 ; ii < numWords ; ii++)
{
char[] word = new char[20];
for (int zz = 0 ; zz < word.length ; zz++)
word[zz] = (char)(65 + rnd.nextInt(26));
results[ii] = new String(word);
}
return results;
}
private static String buildRePortion(String[] words)
{
StringBuffer sb = new StringBuffer("(?:");
for (int ii = 0 ; ii < words.length ; ii++)
{
sb.append(ii > 0 ? "|" : "")
.append(words[ii]);
}
sb.append(")");
return sb.toString();
}
}
Kind of similar to what kd is saying, I would suggest putting all the words/phrases in your OR conditions into HashSets and capture what ever is in that position and check to see if it's in the set. I'm not sure that will work but it seems you have some 'static' text that bounds these things you are looking for.
It seems to me that you are implementing a dictionary with an unordered list. The correct data structure should be HashMap, HashSet or trie.
Hi kdgregory,
thanks for your code,
I have tried your benchmarking code.
The same problem still happen.
From my understanding and after I had executed, your RE has only two group of OR. That is , (adfafasf|afafa|afasdfa|afafa|...) xxx (AFDASFD|AFAFDASF|ADSFASFD|...).
(... means that there are many elements)
The time to compile these two groups are fast, even if each group has 10000 different elements it can alternate but this is not the case for many groups.
For example , "(fasdf|afasdf|asdfaf|afdasfda|...) (fasdf|afasdf|asdfaf|afdasfda|...) (fasdf|afasdf|asdfaf|afdasfda|...) (fasdf|afasdf|asdfaf|afdasfda|...) (fasdf|afasdf|asdfaf|afdasfda|...)" will take very very long.
Any way I am trying to datamine some information from many document using RE. But the compilation time takes too long.
Any ideas?
> Any way I am trying to datamine some information from
> many document using RE. But the compilation time
> takes too long.
>
> Any ideas?
Why do you use RE in the first place? RE is for finding patterns in text. From the pieces of your examples, you are not looking for patterns, you are doing dictionary lookup. RE for a dictionary implementation is equivalent to an unordered list. To implement a dictionary, the choice of data structure is HashSet, HashMap or trie.
> > Any way I am trying to datamine some information
> from
> > many document using RE. But the compilation time
> > takes too long.
> >
> > Any ideas?
>
> Why do you use RE in the first place? RE is for
> finding patterns in text. From the pieces of your
> examples, you are not looking for patterns, you are
> doing dictionary lookup. RE for a dictionary
> implementation is equivalent to an unordered list. To
> implement a dictionary, the choice of data structure
> is HashSet, HashMap or trie.
Before jumping on this guy telling him not to do something, it would be good to know what he is doing or not. As I see his RE fragment, he is doing a frase matcher instead of a dictionary, and he wants to match them acording to meaning instead of exact words.
Neither HarshSet nor HarshMap will fulfill this purpose. But in this particular case, maybe implementing a state machine would be a good Idea.
But that depends on the need of the OP.
May the code be with you.
> Before jumping on this guy telling him not to do
> do something, it would be good to know what he is
> doing or not. As I see his RE fragment, he is doing a
> frase matcher instead of a dictionary, and he wants
> to match them acording to meaning instead of exact
> words.
The two long long long (...|...|...|...|...|) patterns ARE dictionaries. I am not saying that the code can be implemented with ONE dictionary lookup. It actually needs 2 dictionary lookups and 1 pattern matching.
First, match pattern "(.+) me (.+) java".
Then do 2 dictionary lookups for group(1) and group(2).
Wouldn't it be much slower to do it in 3 steps than just match 1 big pattern? The answer is yes and no. If the (...|...|...|...|...|) sub-patterns are short, yes, 3-step implementation is slower. But we are talking about very very very long sub-patterns. Matching this part is linear complexity O(n). The 3-step implementation is constant complexity O(1).
Try using the regular expression classes from the ORO project: http://jakarta.apache.org/oro/index.html. They aren't identical to the java.util.regex classes, but they're a hell of a lot faster. I had a similar problem to yours. I had a complex regular expression. It took 20+ minutes to compile. Using ORO it takes some 0.15 seconds.