Best way to implement a word frequency counter (input => textfile)?
i had this for an interview question and basically came up with the solution where you use a hash table...
//create hash table
//bufferedreader
//read file in,
//for each word encountered, create an object that has (String word, int count) and push into hash table
//then loop and read out all the hash table entries
===skip this stuff if you dont feel like reading too much
then the interviewer proceeded to grill me on why i shouldn't use a tree or any other data structure for that matter... i was kidna stumped on that.
also he asked me what happens if the number of words exceed the capacity of the hash table? i said you can increase the capacity of the hash table, but it doesn't sound too efficient and im not sure how much you know how to increase it by. i had some ok solutions:
1. read the file thru once, and get the number of words in the file, set the hashtable capacity to that number
2. do #1, but run anotehr alogrithm that will figure out distinct # of words
3. separate chaining
===
anyhow what kind of answeres/algorithms would you guys have come up with? thanks in advance.
[1182 byte] By [
lapcherna] at [2007-10-3 1:28:32]

> i had this for an interview question and basically
> came up with the solution where you use a hash
> table...
>
> //create hash table
> //bufferedreader
> //read file in,
> //for each word encountered, create an object that
Well, first you need to check to make sure the word is not already in the hashtable, right? And if it is there, you need to increment the count.
> has (String word, int count) and push into hash
> table
>
> //then loop and read out all the hash table entries
>
>
> ===skip this stuff if you dont feel like reading too
> much
> then the interviewer proceeded to grill me on why i
> shouldn't use a tree or any other data structure for
> that matter... i was kidna stumped on that.
A hashtable has ammortized O(1) time for insert and search. A balanced binary search tree has O(log n) complexity for the same operations. So, a hashtable will be faster for large number of words. The other option is a so-called "trie" (google for more), which has O(log m) complexity, where m is the length of the longest word. So if your words aren't too long, a trie may be just as fast as a hashtable. The trie may also use less memory than the hashtable.
> also he asked me what happens if the number of words
> exceed the capacity of the hash table? i said you can
> increase the capacity of the hash table, but it
> doesn't sound too efficient and im not sure how much
> you know how to increase it by. i had some ok
> solutions:
The hashmap implementation that comes with Java grows automatically, you don't need to worry about it. It may not "sound" efficient to have to copy the entire datastructure, the copy happens quickly, and occurs relatively infrequently compared with the number of words you'll be inserting.
> 1. read the file thru once, and get the number of
> words in the file, set the hashtable capacity to that
> number
> 2. do #1, but run anotehr alogrithm that will figure
> out distinct # of words
> 3. separate chaining
> ===
>
> anyhow what kind of answeres/algorithms would you
> guys have come up with? thanks in advance.
I would do anything to avoid making two passes over the data. Assuming you're reading it from disk, most of the time will be spent reading from disk, not inserting to the hashtable. If you really want to size the hashtable a priori, you can make it so its big enough to hold all the words in the english language, which IIRC is about 20,000.
And relax, you had the right answer. I used to work in this field and this is exactly how we implemented our frequency counter and it worked perfectly well. Don't let these interveiewers push you around, just tell them why you thought hashtable was the best choice; show off your analytical skills!
thank you for the helpful reply!
> A hashtable has ammortized O(1) time for insert and search. A balanced
> binary search tree has O(log n) complexity for the same operations. So, a
> hashtable will be faster for large number of words. The other option is a
> so-called "trie" (google for more), which has O(log m) complexity, where m is
> the length of the longest word. So if your words aren't too long, a trie may be
> just as fast as a hashtable. The trie may also use less memory than the
> hashtable.
Uh, careful here. You can not neglect the length of the strings in one case and not in the other. The amortized cost of O(1) for the hashtable is only correct if you can compute the hash in constant time. The same goes for the binary tree, you have to compare the strings to each other. If you want to compare these datastructures with the trie, you should add the factor m, which leads to:
Hashtable: O(m) amortized
Binary Tree: O(m * log n)
As a sidenote, one should use the unsynchronized HashMap instead of Hashtable for a performance gain.
And I think I have to correct you on the trie as well. Iirc the cost would be O(m) which is the depth of the trie. You mixed that up with a balanced tree with m nodes.I would guess that for bigger problems the trie apporach is faster than the Hashtable approach.
If you had read the disclaimer you'll note that I'm not responsible for logical errors made before 9:00AM.
A trie could be used to count word frequency, however it's primary advantage or usage is being able to do fast prefix matches.
> I would guess that for bigger problems the trie apporach is faster than the Hashtable approach.
A simple trie makes a new node for every new letter on the path. So if the words don't share common prefixes then that means a lot of nodes are going to be created.
A chained hashtable on the other hand makes a new node for every word, and requires the computation of the hash function.
If there are lots of words that share common prefixes (or repeated words) then the trie will likely be faster. However, if there are lots of non-repeated words then a hashtable will probably be faster.
This is pretty much what you said, but with a little more elaboration.
okay let me get this straight...
are you guys suggesting tries as a method for counting the number of unique words in the file? otherwise how would i store the number of occurences (an object containing the word and # of times it shows up) for a particular word in the trie?
thanks.
edit:
on 2nd thought... i guess each node within the trie would also have similar data objects like the hash table i mentioned correct? (each node has a string and count value).
Message was edited by:
lapchern
> okay let me get this straight...
>
> are you guys suggesting tries as a method for
> counting the number of unique words in the file?
> otherwise how would i store the number of occurences
> (an object containing the word and # of times it
> shows up) for a particular word in the trie?
>
> thanks.
>
> edit:
>
> on 2nd thought... i guess each node within the trie
> would also have similar data objects like the hash
> table i mentioned correct? (each node has a string
> and count value).
>
> Message was edited by:
> lapchern
Well, each node only has the count, the string is inferred from the path you trace through the trie.
My suggestion is still with the HashMap. Sun provides the classes you need, and the performance of both approaches will be similar.