Massive Java Computation - Help Needed with multiprocessing!!!

Hi everyone, I realise that most of you are probably really busy but if anyone could spare a minute to help me with a solution to my problem i would greatly appreciate it.

I have written an app which creates a Collection object.

This Collection object, of which only one will be created has many TestCollection objects, which are stored as an array.

Each Test collection holds an array of comparison objects.

Each comparison holds an array of ints.

A comparison object will be used to compare two trees, one which i call a testing tree, and the other a training tree.

The array of ints which a comparison holds is created from the actual comparison between the two trees.

Now each test collection is concerned with only a single testing tree, so what i essentially have is a Test Collection which holds comparisons between this Test Collections tree against each of the training trees.

Now if you take all of the Test Collections, you have each training tree compared with each testing tree, and the results of all of these.

Hope you are all still with me.

Now each comparison is quick, around a second each. But the problem is that i have 6000 training trees and 2000 testing trees, hence 12 million comparisons to be made!!

Now the time taken to compute the results on this data set is unfeasable, being at least a few days.

So as i have 40 computers at my disposal, i should make the most of them. So what i think i need to do is get each computer to work on a test collection each and then return the results back to a central computer which will then hold the array of test collections and their results (possibly work on a comparison each but i think i may get a bottleneck when returning the results).

Now i am not too bad with Java but have no experience of networking and I am hoping that someone here can steer me in the right direction of how i can get each computer to work on a test collection each but for me to have all of the results in a central location afterwards as the results will then be written to xml for future processing.

Now all of the computers are already networked as they are in my university it is how i can make the most of them that i am unsure of, i have heard one suggestion of mpi, another of rmi and another of opening sockets and doing normal networking.

Any reply / help on this would be greatly appreciated as i am expected to have the results for these experiments very soon and i will be doing all of the above many many times.

Thanks

Danny =)

[2596 byte] By [dannythomas13a] at [2007-10-3 9:26:29]
# 1
if you can widdle that question down to 5 sentances, i will read it
mkoryaka at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 2
I am sorry but i cannot as I need to place quite a specific thread so that anyone who reads it realises what needs to be done....sorry
dannythomas13a at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 3

Now each comparison is quick, around a second each. But the problem is that i have 6000 training trees and 2000 testing trees, hence 12 million comparisons to be made!!

Now the time taken to compute the results on this data set is unfeasable, being at least a few days.

I think a few years, based on what you said. Even if you use 40 computers at there ideal speed it is still going to take days. You need to work a reducing the time to compare. Do you have to compare each int? Are there things you can do once instead of each time? etc.

zadoka at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 4
Can't you make it faster than 1 second per comparision?Kaj
kajbja at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 5

One word ... threading ...

Before I tried to load spread across multiple machines, I would try to get the most out of one machine by threading my testing processes so that I could have as many of them as I could manage running at one time.

Just my 2 krupplenicks on the subject.

PS

puckstopper31a at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 6
>> Now the time taken to compute the results on this>> data set is unfeasable, being at least a few>> days.> I think a few years, based on what you said. Years? It's about 139 days if I get it right.Kaj
kajbja at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 7

> One word ... threading ...

>

> Before I tried to load spread across multiple

> machines, I would try to get the most out of one

> machine by threading my testing processes so that I

> could have as many of them as I could manage running

> at one time.

>

> Just my 2 krupplenicks on the subject.

>

> PS

That might be faster of you have several processors (that is it might not be faster on a singel processor machine)

kajbja at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 8

No disagreement with that but it is the first place I would look for speed.

> > One word ... threading ...

> >

> > Before I tried to load spread across multiple

> > machines, I would try to get the most out of one

> > machine by threading my testing processes so that

> I

> > could have as many of them as I could manage

> running

> > at one time.

> >

> > Just my 2 krupplenicks on the subject.

> >

> > PS

>

> That might be faster of you have several processors

> (that is it might not be faster on a singel processor

> machine)

puckstopper31a at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 9
@OP: you've lost me; you talk about collections, trees, arrays and what have you.You also mention 12,000,000 comparisons which isn't an astonishing numberof comparisons. Care to elaborate and rephrase your question a bit?kind regards,Jos
JosAHa at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 10
> No disagreement with that but it is the first place I> would look for speed.I still don't think that it will help in this case since it sounds that the application is able to utilize 100% of the cpu as it is.
kajbja at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 11
Well. I would recommend using RMI if possible.Another option would be to divide the test set into 40 partitions, run each partition manually on the 40 machines and then add the results?Cheers
duckbilla at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 12

> @OP: you've lost me; you talk about collections,

> trees, arrays and what have you.

> You also mention 12,000,000 comparisons which isn't

> an astonishing number

> of comparisons. Care to elaborate and rephrase your

> question a bit?

>

> kind regards,

>

> Jos

I think that one comparision is a comparision of one training tree with one testing tree. One such comparision involves the comparision of the ints in some way....

Isn't it much clearer now? ;)

kajbja at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 13

thanks for all the replies...however, i cannot multithread as it does indeed use 100% cpu for the whole time and i'm sorry but believe me when i say that i am very sure the code cannot be made much faster than it currently is, i am using a LOT of data and a lot is done to find the results, these trees are very large but plz dont worry yourselves with how i get the results as that is my problem i just need help with solving this number of comparisons quicker than i currently am by using a single cpu.

thanks

dannythomas13a at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 14
actually...there is only one speed up with the current program and that is getting the contents of a text file into a String, that is it, i dont use a fast method for that but i am afraid that it is only a small part so though i will need to correct this it will not solve the problem
dannythomas13a at 2007-7-15 4:40:44 > top of Java-index,Java Essentials,Java Programming...
# 15

> I think that one comparision is a comparision of one training tree with

> one testing tree. One such comparision involves the comparision of

> the ints in some way....

>

> Isn't it much clearer now? ;)

Mwah, I'm thinking of a lexicographical ordering of those trees if the

comparison has to be exact. If trees are allowed to "look alike" the

problem will be much more complex.

kind regards,

Jos

JosAHa at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 16
i wont go into detail about the trees as i want to keep the post as simple as possible, all that matters is that i need to compute results for 2000 TestCollections so if i could use other computers to process them it should be faster, this is what i need help with doingthanks
dannythomas13a at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 17
not sure i understand all said, what about placing 'paired' collections to be compared on jms, and have several 'comparators' running on several machines, consume these and place 'result' objects on the 'completed' queue.
mchan0a at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 18

> i wont go into detail about the trees

Then I definitely can't help you any further. There are several open

questions considering your problem:

1) what exactly is a comparison of two trees? Does the match need to

be exact. i.e. 100% or nothing?

2) how are those trees stored?

3) are these trees large?

4) are those trees ordered in some way?

5) are all those trees pre-loaded? Do they all fit in memory?

6) what's all the stuff about those integer arrays?

7) I'm sure I'm forgetting a whole bunch of questions here ...

If you prefer to keep it vague, be my guest, but I'm out of here then.

Distributed computing is not a substitute for badly designed algorithms

or not well understood problems.

kind regards,

Jos

JosAHa at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 19
The key will be in reducing the time to compare trees. If it takes a second now, profile that comparison. Find the true bottlenecks, it might not be how you are pulling in the file. Once you find the things that are driving that comparison up to 1 sec, then you can work at reducing that.
zadoka at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 20

right...ok...the reason it is taking so long is because i first load in two 50kb text files, i then build a tree representation of them both and i then proceed to compute results which can be used to say how alike the two trees are, and these trees are large as i would say that each tree will have hundreds of thousands of nodes and i can assure you that i have already changed algorithms and profiled and profiled in order to get a great speed up.

dannythomas13a at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 21
also it is only possible to hold say ten trees in memory, any more and i go past my 2gb heap size
dannythomas13a at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 22

> Now each comparison is quick, around a second each.

> But the problem is that i have 6000 training trees

> and 2000 testing trees, hence 12 million comparisons

> to be made!!

>

> Now the time taken to compute the results on this

> data set is unfeasable, being at least a few days.

1.2e6 seconds is about 138 days.

> So as i have 40 computers at my disposal, i should

> make the most of them.

138 / 40 is about 3.5. So your best case will be "at least a few days."

jverda at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 23
can any of your computations be done once per tree and saved off somewhere rather than during each comparision?
zadoka at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 24

> right...ok...the reason it is taking so long is because i first load in two

> 50kb text files, [ ... ] i would say that each tree will have hundreds of

> thousands of nodes.

The orders of magnitude don't match here. Did you intend to write 50Mb?

Otherwise, what exactly did you do with those two small files?

kind regards,

Jos

JosAHa at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 25
It sounds like you're saying, "Trust me, I know what I'm doing, I've already made my code as fast as possible, now tell me how to make it faster as I don't know what I'm doing."
jverda at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 26
the trees are held in memory, compared and then only the int array (the results) are kept, the orders of magnitude are also correct
dannythomas13a at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 27

> right...ok...the reason it is taking so long is

> because i first load in two 50kb text files,

That will be a very fast operation.

> i then

> build a tree representation of them both

Nothing there suggests that it will take long.

> and i then

> proceed to compute results which can be used to say

> how alike the two trees are

Are these the computations that you're sure are already as fast as possible, or are they taking a long time?

>, and these trees are

> large as i would say that each tree will have

> hundreds of thousands of nodes

5e5 nodes is not necessarily a large tree.

> and i can assure you

> that i have already changed algorithms and profiled

> and profiled in order to get a great speed up.

Why don't you post some of what you're doing. I'm not a math or algorithms guy, but a lot of the folks here are very sharp about that kind of thing. You'll get better help if you show what you're doing than if you just say "trust me, you can't make it any better here than I already have."

jverda at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 28

i'm sorry to say this as i feel rude saying it but some of the replies i am receiveing are really quite rude, i only asked if someone was able to help and i'm sorry but i didn't intend to receive posts that suggest that i don't know what i am doing...considering all i asked for was help with distributed computing

dannythomas13a at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 29
ok, i will look at posting some code up...i will have to change things though as the code is owned by my university and even though it will be open source it can't be until i have finished it
dannythomas13a at 2007-7-21 12:48:30 > top of Java-index,Java Essentials,Java Programming...
# 30

> i'm sorry to say this as i feel rude saying it but

> some of the replies i am receiveing are really quite

> rude, i only asked if someone was able to help and

> i'm sorry but i didn't intend to receive posts that

> suggest that i don't know what i am

> doing...considering all i asked for was help with

> distributed computing

Nobody's being rude. It's just that your questions are rather vague. You've gotten some good general suggestions based on the little information you've provided, but they're just speculation. Without details about your requirements, constraints, and current approach, nobody can give you specific advice.

jverda at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 31

> i'm sorry to say this as i feel rude saying it but

> some of the replies i am receiveing are really quite

> rude, i only asked if someone was able to help and

> i'm sorry but i didn't intend to receive posts that

> suggest that i don't know what i am

> doing...considering all i asked for was help with

> distributed computing

Would it be rude if the sales person at Uhaul asked are you sure need to use a UHaul truck to take the trash down the driveway?

If the speed of comparisions can be improved you can do what you want with one computer instead of 40, that would be a good thing right?

zadoka at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 32

> the orders of magnitude are also correct

That implies that you've found a 50KB text file representation for more

than "hundreds of thousands" nodes in a tree. I'm curious now; please

elaborate on what exactly you want to do and what exactly you have done.

kind regards,

Jos

JosAHa at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 33
ok...it would be great if it can be made even faster so i shall post some code up shortlythanks
dannythomas13a at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 34

> ok...it would be great if it can be made even faster

> so i shall post some code up shortly

> thanks

This kind of question generates a lot of intrest (as you have seen by the rapid number of posts), because you are presenting a tough problem. Most of the questions on this forum go like this:

"Java gives me a NullPointerException, I think there is a bug in the compiler"

So if you can post some code and be detailed you will get a lot of help.

zadoka at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 35
can someone plz tell me what html tags i need to use so that the formatting of the code i post is correct plz?
dannythomas13a at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 36
> can someone plz tell me what html tags i need to use> so that the formatting of the code i post is correct> plz?[code]//There is also a code button next to "quote" ]when post[/code]
zadoka at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 37

right, each time i do a comparison i sort the training text string into sorted suffixes i.e

if the string is abra

the suffixes would be

abra

bra

ra

a

and once sorted it would become

a

abra

bra

ra

and the method i use to compute this is here:

public class CompareNodes implements Comparator {

/** Creates a new instance of CompareNodes */

public int compare( Object o1, Object o2 )

{

Node n1 = (Node)o1;

Node n2 = (Node)o2;

String inputString = n1.getRootNode().getInputString();

for( int i = 0; i < n1.getLength() && i < n2.getLength(); i++ )

{

if( inputString.charAt( n1.getIndex() + i ) < inputString.charAt( n2.getIndex() + i ) )

{

return -1;

}

else if( inputString.charAt( n1.getIndex() + i ) > inputString.charAt( n2.getIndex() + i ) )

{

return 1;

}

}

if( n1.getLength() < n2.getLength())

{

return -1;

}

else if( n1.getLength() > n2.getLength() )

{

return 1;

}

else

{

return 0;

}

}

}

now i have profiles the code and this has a self tiem % of 20% according to the netbeans profiler so i guess any speed up here would make a difference

i was actually about to test if the following would be faster or not:

ublic class StringDescending implements Comparator<String>

{

public int compare(String str1, String str2)

{

if(str1.equals(str2)) return 0;

else if(str1.compareTo(str2) < 0)

{

return 1;

}

return -1;

}

}

does anyone have a method which does the same thing but faster?

thanks

dannythomas13a at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 38

Not calling getLength every time through the loop might help a bit.

You might also get some benefit by calling toCharArray() and then looping over the resulting array, rather than calling charAt. However, if the strings are short, that probably won't help much, and may hurt, due to the overhead of creating the array.

Also, you should check for different lengths before trying to compare individual chars. Once you reach the part that compares individual chars, you only need to check your index against one length, not both, since you know both are the same length.

If you just want a descending comparator, you can use Collections.reverseOrder, or just return -str1.compareTo(str2);

I can't comment on the overall suffix approach.

jverda at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 39

Isn't this:

public class StringDescending implements Comparator<String>

{

public int compare(String str1, String str2)

{

if(str1.equals(str2)) return 0;

else if(str1.compareTo(str2) < 0)

{

return 1;

}

return -1;

}

}

The same as:

-1 * str1.compareTo(str2)

zadoka at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 40

You've got a lot of unnecessary method calls in your for loop. Also, why don't you move if( n1.getLength() < n2.getLength())

{

return -1;

}

else if( n1.getLength() > n2.getLength() )

{

return 1;

}

ahead of the loop?

mvantuyla at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 41

and my binary search algorithm takes 17% of the time...i also tried an interpolation search but this is faster, it is used to find the position within this depth of a tree that the current char should be placed:

public class BinarySearch

{

static String inputString;

static String toMatch;

public static int getInsertPosition( char target, ArrayList<Node> children )

{

int high = children.size() - 1;

int low = 0;

while( low <= high )

{

int mid = ( low + high ) / 2;

char compare = children.get( mid ).getMyChar();

if( compare == target )

{

return mid;

}

else if( compare < target )

{

low = mid + 1;

}

else

{

high = mid - 1;

}

}

return low;

}

}

and also this has a 10% self time so maybe this also has an affect on the binary search:

public char getMyChar()

{

return myRootNode.inputString.charAt( this.index );

}

maybe i should explain why i use the above, as the trees are large it uses a lot less memory if instad of storing string on nodes within my defaultmutabletree i instead store the index within the original string, which is stored within the RootNode, and also the length of the string, so from this i can recover all of the strings on the nodes

dannythomas13a at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 42

danny,

Don't take this the wrong way, but the number of obvious improvements that even I was able to suggest casts serious doubt on your claims of your code being highly optimized. This isn't meant to criticize you. I'm saying it for two reasons:

* To encourage you to show more of what you're doing to people here that probably know more about this than you and can probably offer more and better improvements than I've suggested. That can help you with this particular prolem.

* To drive home the point that you should always question your assumptions and always allow for the possibility that others may know more than you, no matter how sure you are of yourself. This will serve you well with this and future problems, and not just in programming.

jverda at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 43

> You've got a lot of unnecessary method calls in your

> for loop. Also, why don't you move if(

> n1.getLength() < n2.getLength())

> {

>return -1;

> else if( n1.getLength() > n2.getLength() )

> {

>return 1;

>

ahead of the loop?

i cant simply move them to the front as it is not only a case of which is the shortest, but i can use one bit of advice that was given about comparing the length first so that i only ever check against one length, if this is all that can be done i shall then compare that with the str1.compareTo method to see which is the quickest

thanks

dannythomas13a at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 44

Also, are you inserting into the middle of an arrayList? LinkedList is much more performant for non-tail inserts.

Of course, with LL, get(i) is slow, so your search might be slow. You might try using Collections.binarySearch to see if it's any faster tha yours, or at least looking at the code for ideas.

jverda at 2007-7-21 12:48:35 > top of Java-index,Java Essentials,Java Programming...
# 45

To reinforce what Jeff said, this codeif( inputString.charAt( n1.getIndex() + i ) < inputString.charAt( n2.getIndex() + i ) )

{

return -1;

}

else if( inputString.charAt( n1.getIndex() + i ) > inputString.charAt( n2.getIndex() + i ) )

{

return 1;

}

uses at least two unnecessary method calls each time. n1.getIndex() and n2.getIndex() will always return the same values. Why not do something like this:int n1Index = n1.getIndex();

int n2Index = n2.getIndex();

int n1Length = n1.getLength();

int n2Length = n2.getLength();

for( int i = 0; i < n1Length && i < n2Length; i++ )

{

if( inputString.charAt( n1Index + i ) < inputString.charAt( n2Index + i ) )

{

return -1;

}

else if( inputString.charAt( n1Index + i ) > inputString.charAt( n2Index + i ) )

{

return 1;

}

Message was edited by:

mvantuyl

mvantuyla at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 46

> > You've got a lot of unnecessary method calls in

> your

> > for loop. Also, why don't you move if(

> > n1.getLength() < n2.getLength())

> > {

> >return -1;

> > else if( n1.getLength() > n2.getLength() )

> > {

> >return 1;

> >

ahead of the loop?

>

> i cant simply move them to the front as it is not

> only a case of which is the shortest,

Oh, right.

What you can do is do int len = Math.min(s1.length(), s2.lengh());

for (int ix = 0; ix < len; ix++)

and then compare as usual.

If you get out of that loop, without having returned 1 or -1, then the string with the shorter length is "less", or they're equal.

But again, just returning the negative of (s1.compareTo(s2)) will probably be just as fast. (And of course, -s1.compareTo(s2) is the same as s2.compareTo(s1)).

jverda at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 47

Here's String.compareTo, FYI:

public int compareTo(String anotherString) {

int len1 = count;

int len2 = anotherString.count;

int n = Math.min(len1, len2);

char v1[] = value;

char v2[] = anotherString.value;

int i = offset;

int j = anotherString.offset;

if (i == j) {

int k = i;

int lim = n + i;

while (k < lim) {

char c1 = v1[k];

char c2 = v2[k];

if (c1 != c2) {

return c1 - c2;

}

k++;

}

} else {

while (n-- != 0) {

char c1 = v1[i++];

char c2 = v2[j++];

if (c1 != c2) {

return c1 - c2;

}

}

}

return len1 - len2;

}

jverda at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 48
sorry but is Collections.reverseOrder faster than my Collections.sort( list, new CompareNodes() ); then? even if i change so that the length is only compared once?and also sorry but will reverseOrder give me alphabetical order...or the opposite?
dannythomas13a at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 49

> sorry but is Collections.reverseOrder faster than my

> Collections.sort( list, new CompareNodes() ); then?

Don't know. Look at its source code. Run some tests and see.

> and also sorry but will reverseOrder give me

> alphabetical order...or the opposite?

It gives the opposite order of the natural order or the opposite of the provided comparator. Read the docs. Read the source code. Play with it and see.

jverda at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 50
btw, sorry but i hadnt actually ran my code since the speed improvements i made yesterday..it actually does 10 comparisons a second, i know it still takes a long time to do 12million but i am trying not to give incorrect information that's allthanks
dannythomas13a at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 51
Okay, and when you say "comparisons," which comparisons are we talking about?
jverda at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 52

hey bril, thanks, the compare method isnt even in the top ten hotspots anymore simply by removing the two length comparisons within the for loop of my comparitor..not sure if it is worth checking the compareTo method now but i will do anyway once i have got a few more of the big ones out of the way

dannythomas13a at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 53

> btw, sorry but i hadnt actually ran my code since the

> speed improvements i made yesterday..it actually does

> 10 comparisons a second, i know it still takes a long

> time to do 12million but i am trying not to give

> incorrect information that's all

> thanks

So this bring us down to about 8 hrs for 40 machines, or two weeks on one box.

Do we know that we really have to do 1.2e6 comparisons?

jverda at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 54

right...after profiling once more, this code here for retrieving the text from a file is second to top but i would like to sort this one as it is relatively simple (in that it doesnt do much):

public class GetTextFromFile

{

public static String getContents(File aFile)

{

StringBuffer buffer = new StringBuffer();

BufferedReader input = null;

try

{

input = new BufferedReader(new FileReader(aFile));

char[] chars = new char[1024];

int i = 0;

do

{

buffer.append(chars, 0, i);

i = input.read(chars, 0, chars.length);

}

while( i != -1 );

buffer.append((char)0);

}

catch (FileNotFoundException ex)

{

System.out.println("Cannot find specified file...");

ex.printStackTrace();

}

catch (IOException ex)

{

ex.printStackTrace();

}

finally

{

try

{

if (input!= null)

{

//flush and close both "input" and its underlying FileReader

input.close();

}

}

catch (IOException ex)

{

ex.printStackTrace();

}

}

return buffer.toString();

}

}

i have been told that if is set the size of the buffer to be length of file +1 then it should be faster but as i had good suggestions for the last problem, does anyone have any for this?

thanks

dannythomas13a at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 55

Using StringBuilder rather than StringBuffer might help a bit, as Builder is not synchronized.

Creating the Buffer/Builder with an initial size big enough to hold the entire file might help, as it will save the SB from having to keep resizing its backing array.

A bigger array to read into might help, but I wouldn't expect a big speed up there.

Not sure why you're putting the zero at the end of your SB.

Are you reading many files, or reading one file many times?

jverda at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 56

@OP: can you explain why you need all suffixes in alphabetical order?

The last ten posts or so suggested some (micro) optimisations w.r.t.

that sorting stuff. I'm curious about what you want to establish and how

you want to accomplish that ... care to elaborate on that?

kind regards,

Jos

JosAHa at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 57

ok, i build a tree representation of suffixes within a string i.e. for string abra

i will have at depth 1, children a, bra, and ra

at depth 2, a will have children eof and bra

this allows for quick searching for strings.

the reason i sort the suffixes first is because i found that my current implementation where i know that the next suffix will be inserted to the right of the last one is faster than the one i wrote where ordering does not matter. however the sorting doesnt seem to be a concern now as it does not show up as a hotspot.

i shall change the stringBuffer and the size of the buffer to see if retrieving the text is quicker, i add the eof file character because i need it as a child on my tree.

dannythomas13a at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 58
This is just a wild guess, because I still don't know what you want to accomplish, but google for a [url= http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/]patricia tree[/url]. It might be of help.kind regards,Jos
JosAHa at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 59
that is a good wild guess! =) i shall have to have a read, this is indeed what i build, however i use my own code to create the trees.thanks, i shall do some reading
dannythomas13a at 2007-7-21 12:48:40 > top of Java-index,Java Essentials,Java Programming...
# 60
yes, it isnt quite a patricia tree but a suffix tree that i am building, not sure if any of the sudo code from that site will help though as building the trees is not slow in my app, it is comparing two trees that is the slow part
dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 61

from that link apparently the following is the quickest way to build a suffix tree:

function test_and_split(s, k, p, t)

{ if(k<=p)

{ // find the t_k transition g'(s,(k',p'))=s' from s

// k1 is k' p1 is p' in Ukkonen '95

var ((k1,p1), s1) = s[Txt.charAt(k)];

if (t == Txt.charAt(k1 + p - k + 1))

return new pair(true, s);

else

{ var r = new State()

s.addTransition(k1, k1+p-k,r);// s->r->s1

r.addTransition(k1+p-k+1, p1, s1);

return new pair(false, r)

}

}

else // k > p; ? is there a t-transition from s ?

return new pair(s[t] != null, s);

}//test_and_split

function canonize(s, k, p)// s>...

{ if(p < k) return new pair (s, k);

// find the t_k transition g'(s,(k',p'))=s' from s

// k1 is k', p1 is p' in Ukk' '95

var ((k1,p1), s1) = s[Txt.charAt(k)];// s--(k1,p1)-->s1

while(p1-k1 <= p-k)// s--(k1,p1)-->s1>...

{ k += p1 - k1 + 1; // remove |(k1,p1)| chars from front of (k,p)

s = s1;

if(k <= p)

{ ((k1,p1), s1) = s[Txt.charAt(k)];// s--(k1,p1)-->s1

}

}

return new pair(s, k);

}//canonize

function ukkonen95()// construct suffix tree for Txt[0..N-1]

{ var s, k, i;

var bt;

root = new State();

bt = new State();// bt (bottom or _|_)

// Want to create transitions for all possible chars

// from bt to root

for (i=0; i < Txt.length; i++)

bt.addTransition(i,i, root);

root.sLink = bt;

s=root; k=0;// NB. k=0, unlike Ukkonen our strings are 0 based

for(i=0; i < Txt.length; i++)

{ var (s,k) = upDate(s, k, i);// follow path from active-point

(s,k) = canonize(s, k, i);

}

}//ukkonen95

but i cant really follow all of the code and also my method is quite fast anyway.

can i just check that the fastest way to get all chars from a file into a string is the following:

public static String getText( File aFile )

{

BufferedReader input = null;

int len = (int)aFile.length();

StringBuilder sB = new StringBuilder( len + 1 );

try

{

input = new BufferedReader(new FileReader(aFile));

}

catch (FileNotFoundException ex)

{

ex.printStackTrace();

}

char[] chars = new char[ len + 1 ];

try

{

sB.append( input.read( chars, 0, len ) );

}

catch (IOException ex)

{

ex.printStackTrace();

}

return sB.toString();

}

Message was edited by:

dannythomas13

dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 62

final InputStream is = new BufferedInputStream(new FileInputStream(file));

final ByteArrayOutputStream baos = new ByteArrayOutputStream();

final byte[] buffer = new byte[65536];

for (int count = 0; (count = is.read(buffer)) >= 0;)

{

baos.write(buffer, 0, count);

}

is.close();

baos.close();

final String result = baos.toString("ASCII"); // Or whatever coding your file is in

System.out.println(result);

sabre150a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 63

right, well there is just one part of the program now that has a self time of 80% so obviously this part of the app is the problem, as of yet i was unable to find a way to speed it up but perhaps some of you guys with more knowledge can see some obvious improvements and then maybe i wont need to distribute the workload, there is quite a lot of code for this part so i am going to try and cut small irrelevant bits out so that the posting isn't too large

thanks

Danny =)

dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 64

> final InputStream is = new

> BufferedInputStream(new FileInputStream(file));

> final ByteArrayOutputStream baos = new

> ByteArrayOutputStream();

>final byte[] buffer = new byte[65536];

> for (int count = 0; (count = is.read(buffer)) >=

> 0;)

>{

>baos.write(buffer, 0, count);

>}

>is.close();

>baos.close();

> final String result = baos.toString("ASCII"); //

> Or whatever coding your file is in

>System.out.println(result);

sorry but why the 65536 here? any reason?

thanks, i shall try it later against my method

dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 65

is anyone here familiar with the netbeans profiler?

if i profile part of application and add only the single method which stood out when i did profile all of application... are all the methods that are then shown in profiling results called by the method i said to profile.

Basically what i can see is that the part i said to profile has a self time of 87%, but everything else in the list is very low and i am wondering if this means that none of the methods that are called are slow, its just the code within the method? is this right?

thanks

dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 66
> sorry but why the 65536 here? any reason?> thanks, i shall try it later against my methodThat's exactly 64KB, and most of your files will fit into that, so all data might be read at once.Kaj
kajbja at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 67

> sorry but why the 65536 here? any reason?

> thanks, i shall try it later against my method

You just need a buffer to try to minimise the number of reads. One normally tries various values until one gets the best.

You can also use something like

File file = new File("your file name");

final DataInputStream is = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));

byte[] buffer = new byte[(int)file.length()];

is.readFully(buffer);

is.close();

final String result =new String(buffer,"ASCII"); // Again, whatever coding your file is in

System.out.println(result);

What it comes down to is that using a StringBuffer or StringBuilder is probably much slower than reading the whole file as bytes befroe converting to a String.

Message was edited by:

sabre150

sabre150a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 68
great, thanks for the reply sabre150
dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 69
Is there any reason why you need to convert you file from bytes to characters?
sabre150a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 70
Hmm.. I thought about using FileChannel and map the file, but the files are probably too small so it isn't worth it. Another option would be to try to do a normal read through the channel.Kaj
kajbja at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 71
characters are what i am inserting into the tree, these may be any characters i.e. chinese but still charcaters...or as i am saying this i am wondering, is it possible to simply use bytes to build the suffix trees, si that what you are saying?
dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 72

> Another option would be to try to do

> a normal read through the channel.

>

Simple (very non-rigorous) experiments I have done indicate that all the methods are about the same speed if one reads the whole file as bytes first.

There may be a 'best' way but from what the OP is saying it seems that this is unlikely to be the limiting factor. I would want to try for a solution that did not use Strings, just byte arrays or even indexes of bytes.

sabre150a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 73

> or as i am saying this i am wondering,

> is it possible to simply use bytes to build the

> suffix trees, si that what you are saying?

Yes.

P.S. Obviously, if you have Chinese characters then the file encoding is not ASCII!

Message was edited by:

sabre150

sabre150a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 74
you all may or not believe this but if i profile and dont exclude core classes i can see that a large portion of the time is taken up by updating my JLabel next to my progres bar in the gui!!!
dannythomas13a at 2007-7-21 12:48:45 > top of Java-index,Java Essentials,Java Programming...
# 75

With large heaps you can run into lots of TLB misses. In one of our applications we are dealing with a 5g heap and we found that using -XX:+UseLargePages improved performance by 25%.

See http://www.devx.com/amd/Article/30529 and http://java.sun.com/docs/hotspot/VMOptions.html#largepages

It is a little trick to set up but if you follow the directions carefully it can be done. Also with our -Xmx5g heap we needed 2656 pages (5,378,048MB). So you need more pages than you specify with the -Xmx setting. So if you get an vm message saying it can't use large pages, try allocating more pages.

Also compare -server and -client vm options and see if they improve anything.

Caffeine0001a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 76

> you all may or not believe this but if i profile and

> dont exclude core classes i can see that a large

> portion of the time is taken up by updating my JLabel

> next to my progres bar in the gui!!!

I must have missed something! I thought you were trying to create a distributed program so what is the GUI doing other than slowing you down dramatically!

sabre150a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 77
> Also compare -server and -client vm options and see> if they improve anything.That's a good point.
kajbja at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 78

> you all may or not believe this but if i profile and

> dont exclude core classes i can see that a large

> portion of the time is taken up by updating my JLabel

> next to my progres bar in the gui!!!

If you insist on a GUI. Don't update the progress more than once per second. And maybe only once every 10 seconds. Do you really need lots of GUI feedback?

Caffeine0001a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 79

> you all may or not believe this but if i profile and

> dont exclude core classes i can see that a large

> portion of the time is taken up by updating my JLabel

Huh? What are you doing? What has a gui got to do with that process

you're trying to optimize? I smell something here.

Jos

JosAHa at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 80
> Huh? What are you doing? What has a gui got to do> with that process > you're trying to optimize? I smell something here.> Basically my comment and it smells very ripe!
sabre150a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 81

smell what, what i am trying to do is optimise my code now as i have seen that there are lots of possible speed ups and i am saying that i have found that the gui is slowing the app down, i do need the gui but i shall take the advice of only updating once every ten seconds, possibly each minute and take out the JLabel which says the current comparison as this only takes a second, thanks again, great advice!

dannythomas13a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 82
> > Huh? What are you doing? What has a gui got to do> > with that process > > you're trying to optimize? I smell something here.> > > > Basically my comment and it smells very ripe!sorry but am i missing something
dannythomas13a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 83
How do you intend to merge the GUIs from each distributed process? If you have 40 process then you can't possibly run round and view each one so what exactly is the GUI doing?
sabre150a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 84

> > > Huh? What are you doing? What has a gui got to

> do

> > > with that process

> > > you're trying to optimize? I smell something

> here.

> > >

> >

> > Basically my comment and it smells very ripe!

>

> sorry but am i missing something here?

They are trying to be funny.

Caffeine0001a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 85
> > > Basically my comment and it smells very ripe!> > > > sorry but am i missing something here?> > > They are trying to be funny.My Mother always said I was very trying!
sabre150a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 86

The GUI could report on what each server is doing. Matter of fact you could have one computer that is just the monitoring server that all the other servers report their progress to. As long as you don't send too many updates (one a minute each or longer), it wouldn't be affect performance much. You going to need to send results from each server anyways.

Caffeine0001a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 87

if i do use a distributed system then i will not use the gui but it has been very handy whilst i have been building it that is all.

I have some very interesting results now that i have profiled the code without excluding core classes, I am hoping that if i post up the code that you need, one of you java jedi's can reduce my code down from a few days to within a day, that would be amazing! with excluding these classes, i have been optimizing things that take less than 1% of the total time and ignoring the big things

dannythomas13a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 88

> take out

> the JLabel which says the current comparison as this

> only takes a second, thanks again, great advice!

That definitely would slow things down a lot. That would be way too many comparisons for a user to see anyways. If showing the comparision is important feedback then just show ever 1,000,000 comparison or so.

if (comparisonCount % 1000000 == 0) {

showComparison();

}

Caffeine0001a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 89

ok, coming in at number 1 is:

java.awt.EventQueue.postEventPrivate with 47% of the total time, but i know that this is because a JLabel is updated so often, i dont know how to change it so that it doesn't update so often but at least i know it can be done so that is a start.

more importantly at number 2 with 26% of the total time is:

java.util.Collections.list

and java.util.Vector.nextElement with 17%

both of which are called by me doing the following:

int insertPos = BinarySearch.getInsertPosition( searchFor, Collections.list( this.children() ) );

Now i did this because i read that a List has a faster random access time than a Vector, which is what i believe this.children() to be from the api as my Node class which this method is contained extends DefaultMutableTreeNode

please tell me that one of you knows how to dramatically reduce this time as it shoudl have a great effect on the time of the app

thanks

dannythomas13a at 2007-7-21 12:48:50 > top of Java-index,Java Essentials,Java Programming...
# 90

I'm sorry for saying so but I was not trying to be funny and speaking for

Sabre (sorry) I think he wasn't trying to be funny either when he first

replied. I smell a badly designed solution to a less than half understood

problem. Any optimisations are moot at this stage.

kind regards,

Jos

JosAHa at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 91
Vector.nextElement is a part of Collections.list, didnt realise this so i guess it is just the Collections.list part as a whole that can be speeded up
dannythomas13a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 92
actually Josah, the problem IS understood, it is the java techniques which can speed up the result that i am inexperienced at...thanks
dannythomas13a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 93

> Now i did this because i read that a List has a

> faster random access time than a Vector, which is

Depends on if the List is an ArrayList or a LinkedList. On a single threaded application, I doubt you have much of difference between ArrayList and Vector.

In Collections.list() it creates a whole new ArrayList and copies each element into it. That definitely would be slower than just using the Vector.

Why not just use ArrayList from the start and avoid Vector altogether?

Caffeine0001a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 94
i dont think i can as i am extending the defaultmutabletreenode and children of this are of type vector? unless you know of a way so that the DefaultMutableTreeNode will use an ArrayList as its children instead of a VectorMessage was edited by: dannythomas13
dannythomas13a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 95

> Now i did this because i read that a List has a

> faster random access time than a Vector, which is

> what i believe this.children() to be from the api as

> my Node class which this method is contained extends

> DefaultMutableTreeNode

Are you by any chance updating a JTree or are you just using DefaultMutableTreeNode as a container?

sabre150a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 96

> int insertPos = BinarySearch.getInsertPosition(

> searchFor, Collections.list( this.children() ) );

>

> Now i did this because i read that a List has a

> faster random access time than a Vector,

You should use either an array, an ArrayList, or a LinkedList, not a vector.

ArrayList and arrays are fast for random (indexed) access, as in get(i). However, they are slow for inserting anywhere except at the end because you have move every element that comes after where you're inserting.

LinkedList is slow for indexed access near the middle (get(i)) because you have start at the beginning or the end and walk through each element up to half the length of the list. However it is fast for insert and remove from any position (once you have a reference to that element) because you just have to update a few pointers.

Which one is more appropriate for you depends on what you're doing.

jverda at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 97

> doubt you have much of difference between ArrayList

> and Vector.

Let me clarify. ArrayList and Vector are very similiar. Vector's methods are synchronized and ArrayList are not. Both implement the List interface. In your case, you are not using multiple threads, so ArrayList would be just fine.If you can't for some reason just pass the Vector to that method and avoid the coping of the whole List.

Caffeine0001a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 98

i need to be able to access any position in the list as i dont know where the insert position will be and also be able to insert into any position because the position i find by the random access (binary search) is where it will be inserted, however i am using DefaultMutableTreeNode so i am using Vector as the children of a Node

dannythomas13a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 99

> i dont think i can as i am extending the

> defaultmutabletreenode and children of this are of

> type vector? unless you know of a way so that the

> DefaultMutableTreeNode will use an ArrayList as its

> children instead of a Vector

Not that I know of. Just pass the Vector and profile again. Unless you want to rewrite DefaultMutableTreeNode to use ArrayList.

Caffeine0001a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 100

should this code work, what i have done is instead of changing to an ArrayList, i am sticking with a Vector

int insertPos = BinarySearch.getInsertPosition( searchFor, this.children() );

and

public static int getInsertPosition( char target, Vector<Node> children )

as Node is the class that extends DefaultMutableTreeNode

because i get a compiler error saying that:

getInsertPosition(char,java.util.Vector<Tree.Node>) in search.BinarySearch cannot be applied to (char,java.util.Enumeration)

but from the api, children should be a Vector? obvously it is a school boy error but i dont know the answer

dannythomas13a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 101
ah, do i need to use .elementData and then parse the objects within the binary search?
dannythomas13a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 102
My bad, I didn't realize that this.children() was an enumeration.Message was edited by: Caffeine0001
Caffeine0001a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 103
the problem seems to be that children is protected and children() returns an Enumerator which i dont want
dannythomas13a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 104
> the problem seems to be that children is protected> and children() returns an Enumerator which i dont wantYes,Since you are in the class use "this.children" Notice no ().
Caffeine0001a at 2007-7-21 12:48:55 > top of Java-index,Java Essentials,Java Programming...
# 105
Or create your own access method that returns the children.public Vector getRawChildren() {return this.children;}
Caffeine0001a at 2007-7-21 12:49:00 > top of Java-index,Java Essentials,Java Programming...
# 106

thanks again, using the vector and eliminating the list has saved a lot, now the main culprit is again my compare method so i shall need to do some more work on that...after this nothing will stand out, dont know if this is a good thing yet or not as it may mean that everything needs speed up then lol...nah, hope not

thanks again for all the help, if i can solve a fast comparator i shall give an update of how it is doing on a single processor

dannythomas13a at 2007-7-21 12:49:00 > top of Java-index,Java Essentials,Java Programming...
# 107

I haven't followed the whole thread but it looks like you are constructing a suffix tree _from_ a suffix array. There are linear time suffix tree construction algorithms (you'll have to do some reading on that) and after constructing the suffix tree, you can transform it in linear time into a suffix array. (Lookup time should be the about the same for a suffix tree and a suffix array.)

javafora@gmail.coma at 2007-7-21 12:49:00 > top of Java-index,Java Essentials,Java Programming...
# 108
True; also "the Design and Analysis of Algorithms" by Aho, Hopcroft andUllman, chapter 9 (or 10, I'm not sure; I don't have that book at handright now) is invaluable.kind regards,Jos
JosAHa at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 109
@Op. How many comparisions can you now do per second?Kaj
kajbja at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 110

> thanks again, using the vector and eliminating the

> list has saved a lot, now the main culprit is again

> my compare method so i shall need to do some more

> work on that...after this nothing will stand out,

> dont know if this is a good thing yet or not as it

> may mean that everything needs speed up then

> lol...nah, hope not

> thanks again for all the help, if i can solve a fast

> comparator i shall give an update of how it is doing

> on a single processor

Eventually your file reading is going to take up a significant amount of your time. You will get into a pattern of Read from disk, low cpu usage, then you will process data with high cpu usage and repeat. A very simple change can allow you to process data while reading in the next file without having to do a lot of synchronizing.

Your program can be summerized:

for (File nextFile : files) {

String data = readFile(nextFile);

processData(data);

}

The leads to this type of pattern where each line represent some unit of fixed time.

1 -> Read from file

2 -> Proccess data

3 -> Continue proccessing data

4 -> Read next file

5 -> Process data

6 -> Continue processing data.

7 -> Read next file.

8 -> Process data

9 -> Continue processing data.

10 -> Read next file.

11 -> Process data

12 -> Continue processing data.

etc

Lines 1,4,7,10 do not use much cpu. The same amount of work could be done like this.

1 -> Read from file

2 -> Proccess data and Read next file.

3 -> Continue proccessing data

4 -> Process data and Read next file

5 -> Continue processing data.

6 -> Proccess data and Read next file.

7 -> Continue proccessing data

8 -> Proccess data and Read next file.

9 -> Continue proccessing data

In this example 9 units of time vs 12 for the same number of files. Of course the time saving varies with the ratio of file reading time compared to processing data time.

Now how to I change the code above. By using some classes from [url http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html]java.util.concurrent[/url]

BlockingQueue<String> blockingQueue = new ArrayBlockingQueue<String>(1);

new Thread(new Runnable() {

public void run() {

processData(blockingQueue.take());

}

}

).start();

for (File nextFile : files) {

String data = readFile(nextFile);

blockingQueue.put(data);

}

This works quite well on a single CPU system and with just a few minor changes you could take also take avantage of machines with additional processors/cores.

Caffeine0001a at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 111

> ok, coming in at number 1 is:

> java.awt.EventQueue.postEventPrivate with 47% of the

> total time, but i know that this is because a JLabel

> is updated so often, i dont know how to change it so

> that it doesn't update so often but at least i know

> it can be done so that is a start.

Why do you have a GUI? This is not necessary based on the requirements you gave us.Don't try to update it less frequently, get rid of it completely! It looks like it is taking at least half the time.

zadoka at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 112

Forgot a for loop:

new Thread(new Runnable() {

public void run() {

for (;;) {

processData(blockingQueue.take());

}

}

}

).start();

for (File nextFile : files) {

String data = readFile(nextFile);

blockingQueue.put(data);

}

Caffeine0001a at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 113

ok, as an update i now get nearly 100 comparisons a second which is loads faster so thanks guys but unfortunately it is still not within a day which would be ideal really. but thanks anyway to everyone who had advice, loads faster!

after profiling i now have the following:

14% of the time is spent doing the binary search:

public static int getInsertPosition( char target, Vector<Node> children )

{

int high = children.size() - 1;

int low = 0;

while( low <= high )

{

int mid = ( low + high ) / 2;

char compare = children.get( mid ).getMyChar();

if( compare == target )

{

return mid;

}

else if( compare < target )

{

low = mid + 1;

}

else

{

high = mid - 1;

}

}

return low;

}

but almost half of this time is spent doing getMyChar which is the following:

public char getMyChar()

{

return myRootNode.inputString.charAt( this.index );

}

now unfortunatley i have to use getMyChar as i cannot store the string on each node, it would use far more memory so what i do instead is store the index and also length within the original text.

the next is very similar to the getMyChar and takes 10% of the time and that is:

public String getNodeString ()

{

return myRootNode.getInputString().substring( this.index, this.index + this.length );

}

but again i cannot see a way around this either.

these are the small bits of code that wouuld make a difference if they could be made faster, would be very grateful if anyoen can spot any speed ups.

There is one other thing that takes a bit of time but it is more code and takes more thinking so i'm not sure if anyone could be bothered looking at it, may post it up if anyone fancies it

thanks

Danny

dannythomas13a at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 114

> Forgot a for loop:

>

> > new Thread(new Runnable() {

>public void run() {

>for (;;) {

>processData(blockingQueue.take());

> }

>}

>

> ).start();

>

> or (File nextFile : files) {

>String data = readFile(nextFile);

> blockingQueue.put(data);

> }

>

wow, thanks for that caffeine, you are dead right, that is what is happening, i actually created a multithreaded version and got a thread to load a file whilst the other thread was processing the current one but it didn't work out too well so i took it out, will definitely give that a shot over the weekend.

thanks

dannythomas13a at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 115
@Op. Why are you working on strings instead of char[] and views of that array?
kajbja at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 116

> > ok, coming in at number 1 is:

> > java.awt.EventQueue.postEventPrivate with 47% of

> the

> > total time, but i know that this is because a

> JLabel

> > is updated so often, i dont know how to change it

> so

> > that it doesn't update so often but at least i

> know

> > it can be done so that is a start.

>

> Why do you have a GUI? This is not necessary based

> on the requirements you gave us.Don't try to

> update it less frequently, get rid of it completely!

> It looks like it is taking at least half the time.

He already stated why he has a GUI. And it is very easy to fix the 47% total time spent on GUI update. He could also use [url http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/Timer.html]javax.swing.Timer[/url] to update the label every 5 seconds.

I do believe he has tied his code too tightly to the GUI and that should turn off his GUI while he tries to find hotspots in his algorithms. But as you can see if he were later to add the GUI back in and does it wrong then it can have a very large impact indeed. A lot of applications when doing long running tasks update their GUI's way to frequently. This is something beginner programmers need to learn.

Caffeine0001a at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 117

> @Op. Why are you working on strings instead of char[]

> and views of that array?

getNodeString() is called within my comparator as for that i need to sort the suffix array:

public class NewCompareNodes implements Comparator

{

public int compare(Object o1, Object o2)

{

String s1 = ((Node) o1).getNodeString();

String s2 = ((Node) o2).getNodeString();

return s1.compareTo(s2);

}

}

do you know of a quicker way?

dannythomas13a at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 118

> do you know of a quicker way?

No, I just saw that it looks like you often want to work on the characters in a String, so why do you have a string to begin with? Why not char[]

E.g. getMyChar() should just return a character from the array.

Btw. Have you tried to inline that code instead of calling the method?

kajbja at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 119
inline which code, i cant inline the comparator can i? if i am calling Arrays.sort....btw is char[ ] quicker than a String as i could set inputString within RootNode to be a char[ ] if it would be quicker
dannythomas13a at 2007-7-21 12:49:01 > top of Java-index,Java Essentials,Java Programming...
# 120

> inline which code,

getMyChar()

>

> btw is char[ ] quicker than a String as i could set

> inputString within RootNode to be a char[ ] if it

> would be quicker

charAt will become quicker, and also other operations where you wan't access the elements within the string.

There might however be other parts of the applications which might become slower.

kajbja at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 121

> inline which code, i cant inline the comparator can

> i? if i am calling Arrays.sort....

>

> btw is char[ ] quicker than a String as i could set

> inputString within RootNode to be a char[ ] if it

> would be quicker

i think in a future version i will be using bytes instead of char's as i will need to be able to use i.e. chinese symbols, any symbold should be accepted really

dannythomas13a at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 122
is a CharSequence quicker than a char[ ]?
dannythomas13a at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 123

> i think in a future version i will be using bytes

> instead of char's as i will need to be able to use

> i.e. chinese symbols, any symbold should be accepted

> really

In that case you should definitely use chars, as they support all Unicode characters. Using bytes would require you to decode the bytes to chars every time you wanted to do something with the chars.

And especially since you mentioned this in the context of sorting. Which ordering would you be using for the Chinese characters?

DrClapa at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 124
Which ordering would you be using> for the Chinese characters?lol, god knows, don't know anything about chinese chars, my supervisor mentioned that it will need to handle them that's all
dannythomas13a at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 125

> is a CharSequence quicker than a char[ ]?

Come on, you're starting to ask stupid questions here. A CharSequence is an interface so it doesn't do anything. You can't ask how fast an interface runs because it doesn't run. There can be many classes that implement CharSequence. You would have to ask the question about whatever class you chose.

DrClapa at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 126

> lol, god knows, don't know anything about chinese

> chars, my supervisor mentioned that it will need to

> handle them that's all

That itself isn't a problem. The problem is that you are trying to optimize a design before you even know what the design is. You're going to find you have made some decisions that conflict with the actual requirements, when you find out what those are.

And even that isn't really a problem. Once you find the requirements, you change the code to satisfy them. If that requires throwing away several days of work you spent on optimizing some other task you aren't actually going to be implementing, then so be it. That's just life in the computing world.

DrClapa at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 127

>

> String s1 = ((Node) o1).getNodeString();

>String s2 = ((Node) o2).getNodeString();

>

>return s1.compareTo(s2);

Instead of creating two new strings using the index of the node (just to use the compareTo method), you can write your own comparison code based on indexes on the original string.

javafora@gmail.coma at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 128
thanks, using char[] instead of string has made it over 30% faster, is that right if 11% of a small test was 7930ms but is now 5496ms.
dannythomas13a at 2007-7-21 12:49:06 > top of Java-index,Java Essentials,Java Programming...
# 129

the slowest part is again the comparitor though, this is the new code using a char[] instead of a String:

public class CompareNodes implements Comparator {

/** Creates a new instance of CompareNodes */

public int compare( Object o1, Object o2 )

{

Node n1 = (Node)o1;

Node n2 = (Node)o2;

int len = Math.min(n1.getLength(), n2.getLength());

char[] inputString = n1.getRootNode().getInputString();

for( int i = 0; i < len; i++ )

{

if( inputString[ n1.getIndex() + i ] < inputString[ n2.getIndex() + i ] )

{

return -1;

}

else if( inputString[ n1.getIndex() + i ] > inputString[ n2.getIndex() + i ] )

{

return 1;

}

}

if( n1.getLength() < n2.getLength())

{

return -1;

}

else if( n1.getLength() > n2.getLength() )

{

return 1;

}

else

{

return 0;

}

}

}

c