count occurrences of a phrase
Hi, what is the best way to count occurrences of a phrase (multi-gram words) in a document (corpus)?
I am using String.split("regular expression here") to split the content of a document. For example to count how many times "this noun phrase" occurs in a document, i do
String nounSingular ="this noun phrase";
String nounPlural ="this noun phrases";
String documentContent="blahblah...";
int occur = documentContent.split("\\b+"+nounSingluar+"\\b+").length;
......
But my worry is that regex processing is heavy. So the scalability of this method may be bad over large corpus, and long noun phrase strings.
Any better ideas please? Using String.indexOf() iteratively?
Many thanks!
[902 byte] By [
zqzuka] at [2007-11-27 1:45:48]

Hi, Thanks guys. Tokenizer should work I think, and usually it is quicker than split; the file is any kind of plain text file, not necessarily in a line-by-line format. And would it be slower to scan line by line?
What about using string.indexOf(), is it quicker... and any other suggestions please?
Thanks a lot!
i don't know which one is quicker, but you can effectively call indexOf recursively to count the occurences
i suggest that you try the different solutions (indexOf, tokenizer, or split)
you can try to see which one is the fastest by using:
long beginning = System.currentTimeMillis();
// try a solution
long end = System.currentTimeMillis();
System.out.println("execution took :"+(end-beginning)+" ms");
do this with the 3 solutions, and just see which one is the better
note that for the result to be significant, you have to try it on a big text, not just a dozen of words :p