count occurrences of a phrase

Hi, what is the best way to count occurrences of a phrase (multi-gram words) in a document (corpus)?

I am using String.split("regular expression here") to split the content of a document. For example to count how many times "this noun phrase" occurs in a document, i do

String nounSingular ="this noun phrase";

String nounPlural ="this noun phrases";

String documentContent="blahblah...";

int occur = documentContent.split("\\b+"+nounSingluar+"\\b+").length;

......

But my worry is that regex processing is heavy. So the scalability of this method may be bad over large corpus, and long noun phrase strings.

Any better ideas please? Using String.indexOf() iteratively?

Many thanks!

[902 byte] By [zqzuka] at [2007-11-27 1:45:48]
# 1
i don't know if it could work, but you can also use the StringTokenizer, with your string as a delimiter, and then call "countTokens()"
calvino_inda at 2007-7-12 1:06:36 > top of Java-index,Java Essentials,Java Programming...
# 2
Is the document some flat text file on disk? If so, you could read the file line by line and scan each line for the occurrence.You can use the java.util.Scanner class to read a text file: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
prometheuzza at 2007-7-12 1:06:36 > top of Java-index,Java Essentials,Java Programming...
# 3

Hi, Thanks guys. Tokenizer should work I think, and usually it is quicker than split; the file is any kind of plain text file, not necessarily in a line-by-line format. And would it be slower to scan line by line?

What about using string.indexOf(), is it quicker... and any other suggestions please?

Thanks a lot!

zqzuka at 2007-7-12 1:06:36 > top of Java-index,Java Essentials,Java Programming...
# 4

i don't know which one is quicker, but you can effectively call indexOf recursively to count the occurences

i suggest that you try the different solutions (indexOf, tokenizer, or split)

you can try to see which one is the fastest by using:

long beginning = System.currentTimeMillis();

// try a solution

long end = System.currentTimeMillis();

System.out.println("execution took :"+(end-beginning)+" ms");

do this with the 3 solutions, and just see which one is the better

note that for the result to be significant, you have to try it on a big text, not just a dozen of words :p

calvino_inda at 2007-7-12 1:06:36 > top of Java-index,Java Essentials,Java Programming...
# 5
Thanks!
zqzuka at 2007-7-12 1:06:36 > top of Java-index,Java Essentials,Java Programming...