Optimization
Hi everybody,
I thought I'd done very well optimizing my phrase counter that I implemented, but when I started testing it with bigger things, say 100,000 (total) words and up, it lags considerably. By 130,000 words, it takes a solid 3 minutes to do all the work. Now, I know that's not too bad, but I was hoping to make it faster, because 2 other functionalities are not affected at all by length and another is not majorly affected, this is the only one that lags. I posted my code for the method in a self-contained program with helpful comments; can anyone see a place here where I might be able to make this faster? Of course, it may very well not be possible to speed this up any more, but if that's the case I'd really like to know. It seems to me, by testing it, that the slow part is somewhere in the first half of the program.
Thanks!
Jezzica85
publicstaticvoid main(String args[]){
try{
TreeMap[] phrasesBySize =new TreeMap[8];
for(int i = 0; i < 8; i++ ){
TreeMap<String, ArrayList><Integer>> map =new TreeMap<String, ArrayList><Integer>>();
phrasesBySize[i] = map;
}
TreeMap<String, ArrayList><Integer>> phrases =new TreeMap<String, ArrayList><Integer>>();
TreeMap<String, ArrayList><Integer>> phraseList =new TreeMap<String, ArrayList><Integer>>();
ArrayList<Integer> indexes =new ArrayList<Integer>();
ArrayList<Integer> markers =new ArrayList<Integer>();
int lengthOfPhrase = 2;
try{
// # and $ are chapter and sentence ending markers, not counted as words
for(int i = 0; i < words.size(); i++ ){
String ready = words.get( i );
String ready2 = words.get( i + 1 );
String ready3 = words.get( i + 2 );
if( ready.equalsIgnoreCase("#" ) ){
markers.add( i );
}
if( !ready.equalsIgnoreCase("$" ) && !ready.equalsIgnoreCase("#" ) ){
if( !ready2.equalsIgnoreCase("$" ) && !ready2.equalsIgnoreCase("#" ) ){
if( !ready3.equalsIgnoreCase("$" ) && !ready3.equalsIgnoreCase("#" ) ){
indexes.add( i );
}
}
}
}
}catch( IndexOutOfBoundsException i ){}
do{
lengthOfPhrase++;
TreeMap<String, ArrayList><Integer>> phraseRaw =new TreeMap<String, ArrayList><Integer>>();
ArrayList<Integer> newIndexes =new ArrayList<Integer>();
boolean okay =true;
int chapter = 1;
for(int i = 0; i < indexes.size(); i++ ){
Integer index = indexes.get( i );
String ready = words.get( index );
if( chapter < markers.size() ){
if( index > markers.get( chapter ) ){
chapter++;
}
}
if( !ready.equalsIgnoreCase("$" ) && !ready.equalsIgnoreCase("#" ) ){
okay =true;
for(int j = 0; j < lengthOfPhrase; j++ ){
if( j != 0 ){
ready = ready.concat(" " );
String next =null;
try{
next = words.get( index + j );
}catch( IndexOutOfBoundsException ix ){
okay =false;
break;
}
if( okay ){
if( next.equalsIgnoreCase("$" ) || next.equalsIgnoreCase("#" ) ){
okay =false;
break;
}else{
ready = ready.concat( next );
}
}
}
}
if( okay ){
newIndexes.add( indexes.get( i ) );
if( phraseList.containsKey( ready ) ){
ArrayList<Integer> values = phraseList.get( ready );
values.add( chapter );
phraseList.put( ready, values );
}else{
if( phraseRaw.containsKey( ready ) ){
ArrayList<Integer> values = phraseRaw.get( ready );
values.add( chapter );
phraseList.put( ready, values );
}else{
ArrayList<Integer> newValues =new ArrayList<Integer>();
newValues.add( chapter );
phraseRaw.put( ready, newValues );
}
}
}
}
}
indexes =new ArrayList<Integer>();
indexes = newIndexes;
}while( lengthOfPhrase < 10 );
// Get unique values in the map
Collection<ArrayList><Integer>> raws = phraseList.values();
ArrayList<ArrayList><Integer>> values =new ArrayList<ArrayList><Integer>>();
for(ArrayList<Integer> value: raws ){
if( !values.contains( value ) ){
values.add( value );
}
}
// Sort the map by values
for( ArrayList<Integer> element: values ){
ArrayList<String> splitPhrases =new ArrayList<String>();
Iterator splitIterator = phraseList.entrySet().iterator();
while( splitIterator.hasNext() ){
Map.Entry current = (Map.Entry)splitIterator.next();
ArrayList<Integer> value = (ArrayList<Integer>)current.getValue();
if( value.equals( element ) ){
String phrase = (String)current.getKey();
splitPhrases.add( phrase );
}
}
Object[] tempArray = splitPhrases.toArray();
String[] phraseArray =new String[tempArray.length];
for(int c = 0; c < tempArray.length; c++ ){
phraseArray[c] = (String)tempArray[c];
}
for(int a = 0; a < phraseArray.length; a++ ){
for(int b = 0; b < phraseArray.length; b++ ){
if( !phraseArray[a].equals( phraseArray[b] ) ){
if( phraseArray[b].contains( phraseArray[a] ) ){
phraseArray[a] ="";
}
}
}
}
for(int z = 0; z < phraseArray.length; z++ ){
if( !phraseArray[z].equals("" ) ){
phrases.put( phraseArray[z], element );
}
}
}
// Put phrases in the proper maps
Iterator sortIterator = phrases.entrySet().iterator();
while( sortIterator.hasNext() ){
Map.Entry current = (Map.Entry)sortIterator.next();
ArrayList<Integer> value = (ArrayList<Integer>)current.getValue();
String phrase = (String)current.getKey();
int spaceCounter = 0;
for(int i = 0; i < phrase.length(); i++ ){
if( phrase.charAt( i ) ==' ' ){
spaceCounter++;
}
}
for(int i = 2; i <= 9; i++ ){
if( spaceCounter == i ){
phrasesBySize[i-2].put( phrase, value );
}
}
}
}catch (Exception e{
e.printStackTrace();
}
}
[12065 byte] By [
jezzica85a] at [2007-11-27 9:42:54]

I didn't look closely at your code to investigate the lag but I will say this, whenever you see that many nested if statements and loops you should be thinking it's time to refactor. For example all your code is in the one (main) method. You are allowed to write more than one method.
This is all in one main method because I wanted to create something self-contained for posting, but in my code this actually is the body of one big method. So, you're saying that it would speed it up if I put all the looping parts into different methods?Jezzica85
There is no guarantee and I highly doubt performance will improve by splitting into smaller methods. It is just for maintenance reasons. It is just a good habit to get into. Imagine if you get a new job and the first task they give you is to modify a program written by someone else. This program is 10,000+ lines of code and it is all in the one method. It would be a nightmare to fix code like that. However, if you had to look at all those lines split into methods each containing about 10 - 20 lines of code, then it would be much easier to follow what the code is doing.
catch( IndexOutOfBoundsException i ) {}
This has nothing to do with performance, but this is a very, very bad idea.
You're saying: "When you encounter a bug in my code, instead of telling me about it, just ignore it and pretend everything's fine."
jverda at 2007-7-12 23:47:15 >

Why is that a bad idea? Eventually the scanning over the word list is going to run off the end. It's not technically an error then, it just happens because it gets to the end of the list and I don't want the program to blow up. So, I guess what you're saying for that is, I should be scanning the word list differently?
It is a bad idea because there have been numerous posts oh these fora wher a person has done the very same thing and wondered why there program does "nothing". Well it does do something, it stops after an error but they have no idea because nothing is reported about it.
I've had my fair share of those; I know exactly what you mean. I meant more specifically, why is it a bad idea to catch the exception like that when you're planning on the scan running off the end of the list?
What you should do is something like logging. That is you send a report to a log file and then afterwards you can examine the log file to see if anything has gone wrong. This is especially important on live 24/7 systems. It needs to continue working but someone also needs to know if anything went wrong so they can fix it.
OK, so basically, tell it to write a log file every time I run this program? I'm still not sure I understand the general objection. Since this program is looking for phrases of a certain length, it's bound to run off the end of the list once it gets too close to the end (say, it's looking for phrases three words long and it starts at the second to last word). I know it's going to do that, and it isn't technically an error, it should just stop reading there.
Are you saying I should tell it to stop reading earlier?
> program to blow up. So, I guess what you're saying> for that is, I should be scanning the word list> differently?Yes.I haven't looked at your code--there's too much of it--but there's almost certainly a better way to detect the end.
jverda at 2007-7-12 23:47:15 >

> What you should do is something like logging. That is
> you send a report to a log file and then afterwards
> you can examine the log file to see if anything has
> gone wrong. This is especially important on live 24/7
> systems. It needs to continue working but someone
> also needs to know if anything went wrong so they can
> fix it.
She's asking whether and why using an exception to detect the end.
jverda at 2007-7-12 23:47:15 >

> Are you saying I should tell it to stop reading
> earlier?
If it were me, yes. I would write code the made sure it never threw an out of bounds exception. As with all advice been given so far, it is just general advice
on how to write better code. You are quite free to ignore it all and do it how you like but if you plan on pursuing a career in the IT industry then it would be better
to follow the advice given as you are sure to be given guidleline on how it is done in the company you work for and if you fail to adhere, you could find
yourself looking for a new job.
Thanks, I understand now. That was an easy change--just telling the scanning to stop two indexes before the end.
I thought about what you all said and tried to put the code that's slowing down my method the most in this post--it's significantly less, so maybe it'll be easier to read. The for loop gets every valid index where a phrase of three or more words can be made, and the do/while loop goes through those valid indices, weeding out indices that can't make phrases of length 4, then 5, etc.
I hope that's clearer, and thanks for the help so far!
for (int i = 0; i < words.size() - 2; i++) {
String ready = words.get(i);
String ready2 = words.get(i + 1);
String ready3 = words.get(i + 2);
if (ready.equalsIgnoreCase("#")) {
markers.add(i);
}
if (!ready.equalsIgnoreCase("$") && !ready.equalsIgnoreCase("#")) {
if (!ready2.equalsIgnoreCase("$") && !ready2.equalsIgnoreCase("#")) {
if (!ready3.equalsIgnoreCase("$") && !ready3.equalsIgnoreCase("#")) {
indexes.add(i);
}
}
}
}
do {
lengthOfPhrase++;
TreeMap<String, ArrayList><Integer>> phraseRaw = new TreeMap<String, ArrayList><Integer>>();
ArrayList<Integer> newIndexes = new ArrayList<Integer>();
boolean okay = true;
int chapter = 1;
for( int i = 0; i < indexes.size(); i++ ) {
Integer index = indexes.get( i );
String ready = words.get( index );
if( chapter < markers.size() ) {
if( index > markers.get( chapter ) ) {
chapter++;
}
}
if( !ready.equalsIgnoreCase( "$" ) && !ready.equalsIgnoreCase( "#" ) ) {
okay = true;
for( int j = 0; j < lengthOfPhrase; j++ ) {
if( j != 0 ) {
ready = ready.concat( " " );
String next = null;
try {
next = words.get( index + j );
} catch( IndexOutOfBoundsException ix ) {
okay = false;
break;
}
if( okay ) {
if( next.equalsIgnoreCase( "$" ) || next.equalsIgnoreCase( "#" ) ) {
okay = false;
break;
} else {
ready = ready.concat( next );
}
}
}
}
if( okay ) {
newIndexes.add( indexes.get( i ) );
if( phraseList.containsKey( ready ) ) {
ArrayList<Integer> values = phraseList.get( ready );
values.add( chapter );
phraseList.put( ready, values );
} else {
if( phraseRaw.containsKey( ready ) ) {
ArrayList<Integer> values = phraseRaw.get( ready );
values.add( chapter );
phraseList.put( ready, values );
} else {
ArrayList<Integer> newValues = new ArrayList<Integer>();
newValues.add( chapter );
phraseRaw.put( ready, newValues );
}
}
}
}
}
indexes = new ArrayList<Integer>();
indexes = newIndexes;
} while( lengthOfPhrase < 10 );
I'm not sure how significant it would be (jverd probably would know better than me) but since # and $ do not have upper and lower case versions, you coulduse equals method instead of equalsIgnoreCase.
This is only a micro-optimization but there's a waste of time right here:
ready.equalsIgnoreCase( "$" ) && !ready.equalsIgnoreCase( "#" )
$ and # don't have case, so equalsIgnoreCase() is just a waste of time. Use equals().
ejpa at 2007-7-21 23:08:53 >

Ooh, I didn't know that! I'll definitely change that. Thanks!Can anyone see anything else?
I would also use a LinkedList or perhaps even a HashSet instead of an ArrayList, as all you're doing to it is searching it and adding to it. I'd have a good think about the other data structures too.
ejpa at 2007-7-21 23:08:53 >

I can't use a hashset because I need duplicate elements, but I'm kind of surprised about the linked list suggestion. Are linked lists more efficient than array lists? I always thought they were sort of the same. What is the difference? And which "other data structures" are you referring
Linked lists are more efficient to add to than ArrayLists, but they are slower in direct access. You only seem to be adding and calling contains() which is a linear search so I would use a Linked List. If you need direct access later on I would convert it to an ArrayList after you finish this phase.
Also there's something wrong here:
indexes = new ArrayList<Integer>();
indexes = newIndexes;
The first line is redundant.
ejpa at 2007-7-21 23:08:53 >

Ah...OK, let me see if I understand this.
So, when I first declare my index list, I should say something like:
List<Integer> indexes = new LinkedList<Integer>();
and then after I'm done adding the indices to it, create a new variable and say something like:
ArrayList<Integer> indexes2 = (ArrayList<Integer>)indexes;
And then I'd have to do the same thing with the newIndexes list (see my last post with the code if the name is confusing, sorry)?
Is that right?
Thanks!
This question doesn't make any sense to me.
If you want to turn 'indexes' into a new empty list, the first line I quoted is correct. If you want it to be replaced by 'newIndexes' the second line is correct.
I don't know which of these you want, if any. It's your code. But you can't rationally want both them both in immediate succession.
ejpa at 2007-7-21 23:08:53 >

> I can't use a hashset because I need duplicate> elements, Why? I thought you're just testing for contains(). A HashSet will definitely be faster than a List for that.
jverda at 2007-7-21 23:08:53 >

> Linked lists are more efficient to add to than
> ArrayLists,
I'd think an ArrayList would be faster if you didn't have to resize. If we know the number of elements ahead of time--or can put an upper bound on it--then you can create the ArrayList with that size to start with and avoid resizing.
jverda at 2007-7-21 23:08:53 >

Oops, sorry, I guess I said that wrong. What I meant was, basically what you were saying before about converting from a linked list to an array list and back meant that when I first create my list, it would need to be a linked list so I can add to it fast, and then I could have to create another variable to convert it into an array list to make for fast access?
Hopefully I said it right this time. You were right about those lines being redundant, I took the new ArrayList<Integer> line out.
In my program, resizing is important. I don't know ahead of time how big things have to be, because this is reading words from a text file. That's also why I need duplicate elements--in order to find repeated phrases, I need to know what every word in the file is, so I have to dynamically create a list of all the words, then access them.
You can convert the linked lists to array lists by iterating over your phraseList and phraseRaw maps, e.g.
for (Map.Entry<String, List><Integer> > entry : phraseList.entries())
entry.setValue(new ArrayList<Integer>(entry.getValue());
and also declare your maps as containing Lists, not ArrayLists, e.g.:
TreeMap<String, List><Integer> > phraseRaw = new TreeMap<String, List><Integer> >();
ejpa at 2007-7-21 23:08:53 >

Thanks, I'll have to try all this out. It's about 2 in the morning where I am, so I think I'm going to turn in, but thanks for all the help, I learned a lot. I think I definitely have the knowledge now to make this faster.Thank you again!Jezzica85
OK but can I recommend Lucene? It already does all this and more.
ejpa at 2007-7-21 23:08:53 >
