Quick sort related questions

I have three quicksort related questions that are all related to each other more or less so I am grouping them here.

1) Why does Collections.sort have to call toArray() ? This is just whining i suppose but I don't want my List implementation turned into an array because it defeats the purpose of my particular List implementation.

Now to solve the first one I wrote my own quicksort which works fine but...

2) I have heard there is some magic number whereby the number of elements in a List being less then the magic number it is less costly to do a insertion sort than a quick sort. I may be wrong on the insertion sort but another type of non recursive sort anyway. Is this true? What is this number? Where does it come from?

I already have an insertion type sort as part of my alogrithm which I call if I recurse to a certain level to avoid stack overflows. So I am wondering if there is further tuning I could do.

3) In my implementation of quicksort I maintain 3 buckets instead of 2 and it surprises me that the standard implementation is also not like this. Is there a reason for this?

Perhaps I should explain this last one better I maintain one bucket of equal to values as well which stops them from having to be resorted in the recursive call.

[1299 byte] By [] at [2007-10-2 23:19:36]
# 1

> 1) Why does Collections.sort have to call toArray() ?

To reduce it to a problem they've already solved.

> 2) I have heard there is some magic number

Implementation-specific. At some point the overhead of recursive calls (setting up and tearing down the stack) becomes too expensive.

> 3) In my implementation of quicksort I maintain 3 buckets

In place? Must be quite a bit trickier to prove correct.

YAT_Archivista at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 2

> > 1) Why does Collections.sort have to call toArray()

> ?

>

> To reduce it to a problem they've already solved.

>

But haven't they rather busted encapsulation to do this?

> > 2) I have heard there is some magic number

>

> Implementation-specific. At some point the overhead

> of recursive calls (setting up and tearing down the

> stack) becomes too expensive.

>

I see.

> > 3) In my implementation of quicksort I maintain 3

> buckets

>

> In place? Must be quite a bit trickier to prove

> correct.

Eh? What do you mean prove correct?

YAT_Archivista at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 3

> > > 1) Why does Collections.sort have to call

> toArray()

> > ?

> >

> > To reduce it to a problem they've already solved.

> >

>

> But haven't they rather busted encapsulation to do

> this?

>

I don't understand your comment/question. If they use publically available methods then how have they busted encapsulation?

sabre150a at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 4

> > > > 1) Why does Collections.sort have to call

> > toArray()

> > > ?

> > >

> > > To reduce it to a problem they've already

> solved.

> > >

> >

> > But haven't they rather busted encapsulation to do

> > this?

> >

>

> I don't understand your comment/question. If they use

> publically available methods then how have they

> busted encapsulation?

Because.

I guess it would be helpful to explain what I am doing. I have an implemenation of List that is partially (largely) cached onto disk. The point of this is that these lists contain millions of items and there can be a number of them (lists) in use by the application at any one time.

So I am willing to sacrifice some performance in exchange for being able to use as many of these super-sizes lists that I want. And to be honest it works better than I had originally expected.

At any rate for me toArray is a bad method and it seems to me they didn't think of something like this but it's too bad because the rest of it works like a normal List.

I guess in the end to me calling toArray assumes that the List can be produced as an array which assumes something about a List that in my opinion breaks encapsulation. Just my opinion I guess.

sabre150a at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 5

> At any rate for me toArray is a bad method and it

> seems to me they didn't think of something like this

> but it's too bad because the rest of it works like a

> normal List.

>

> I guess in the end to me calling toArray assumes that

> the List can be produced as an array which assumes

> something about a List that in my opinion breaks

> encapsulation. Just my opinion I guess.

I stil don't see an encapsulation issue. Your list implementation can always be converted to an array if you have enough memory available. Your problem is that it makes no sense because it defeats the object of having the data just on disk. You could always forbid the operation in your list by throwing an UnsupportedOperationException.

sabre150a at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 6
I don't see how sort calling toArray breaks encapsulation any more than the existence of toArray does in the first place (which I would say is not at all).
jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 7
> I don't see how sort calling toArray breaks> encapsulation any more than the existence of toArray> does in the first place (which I would say is not at> all).Fundamentally to me because it assumes a List is backed by an array.I think.
jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 8

> I stil don't see an encapsulation issue. Your list

> implementation can always be converted to an array if

> you have enough memory available. Your problem is

> that it makes no sense because it defeats the object

> of having the data just on disk. You could always

> forbid the operation in your list by throwing an

> UnsupportedOperationException.

I in fact do this exactly. Hence the commencing of whining when trying to sort the Collection bombed out.

jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 9

Okay well thanks to all for the feedback on question #1. I appreciate your opinions. Can anyone comment on #3 because I am more interested in that (or two really) the first one was more of a whine like I said. It just seems wrong to me but that's not going to change anything and really I was just curious.

Does anyone know what the heck YATA was talking about when he said difficult to prove correct? Because I don't. The sort works. Why an equals bucket makes it more difficult to see that I don't know.

jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 10

> > I don't see how sort calling toArray breaks

> > encapsulation any more than the existence of

> toArray

> > does in the first place (which I would say is not

> at

> > all).

>

> Fundamentally to me because it assumes a List is

> backed by an array.

>

No! LinkedList works just as well!

sabre150a at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 11

> > I don't see how sort calling toArray breaks

> > encapsulation any more than the existence of

> toArray

> > does in the first place (which I would say is not

> at

> > all).

>

> Fundamentally to me because it assumes a List is

> backed by an array.

I don't think it assumes that. I merely assumes that, as a sequential collection, it can be represented as an array.

The only hitch I see is that arrays have a hard upper limit on size, where as Lists don't. They do, however, become a tad crippled when they exceed Integer.MAX_VALUE items, because size() stops increasing.

So, do you think that toArray itself breaks encapsulation, or merely sort's use of it?

jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 12

> Does anyone know what the heck YATA was talking about

> when he said difficult to prove correct? Because I

> don't. The sort works. Why an equals bucket makes it

> more difficult to see that I don't know.

Proving an algorithm or program correct, in the CS sense, is much more rigorous than just unit testing. It's more of a mathematical proof. I don't know any details, but my understanding is that it is difficult to prove even simple code correct. An in-place quicksort is not exactly what one would call simple, and so a formal proof of correctness may be quite intractable.

Not that most folks go around proving their code correct, but it may be that some common, core algorithms--such as sorts--do get proven correct.

jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 13

>

> So, do you think that toArray itself breaks

> encapsulation, or merely sort's use of it?

Both.

But I am going to stop calling it breaking encapsulation at this point because I can't defend that point clearly enough.

I do think I can say that to me toArray is code hackery and smells. To me the point of an interface is that you are free from implementation specifics but the existence and use of the method suggest there are times one does care about the implementation specifics.

So I guess my argument is that toArray assumes that the List is implemented by an in memory structure.

jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 14

> > Does anyone know what the heck YATA was talking

> about

> > when he said difficult to prove correct? Because I

> > don't. The sort works. Why an equals bucket makes

> it

> > more difficult to see that I don't know.

>

> Proving an algorithm or program correct, in the CS

> sense, is much more rigorous than just unit testing.

> It's more of a mathematical proof. I don't know any

> details, but my understanding is that it is difficult

> to prove even simple code correct. An in-place

> quicksort is not exactly what one would call simple,

> and so a formal proof of correctness may be quite

> intractable.

>

> Not that most folks go around proving their code

> correct, but it may be that some common, core

> algorithms--such as sorts--do get proven correct.

I see. I don't understand but I see.

Is this (difficulty of proving) really the answer to my question though?

I think a two bucket quicksort is horribly inefficient if all the items are not unique.

jverda at 2007-7-14 15:57:02 > top of Java-index,Other Topics,Algorithms...
# 15

Well I haven;t quite got answers for my questions but I did get some food for thought which for questions of this type is I think the best one can hope for.

So I am going to assign some of the dukes now but I'll leave a couple in case and if you would like to add something please do. I do appreciate it. And for the record I don't mean to be difficult or argumentative I just find this sort of discussion more educational.

at 2007-7-21 8:45:24 > top of Java-index,Other Topics,Algorithms...
# 16

> I think a two bucket quicksort is horribly

> inefficient if all the items are not unique.

But will it matter since you will be spending a significant time accessing the hard disk?

P.S. I use 'heap sort' when I cannot use the built in Java sort. On overage it takes twice as long as quick sort but it's worst case performance is very similar to it's nominal.

sabre150a at 2007-7-21 8:45:24 > top of Java-index,Other Topics,Algorithms...
# 17

> > I think a two bucket quicksort is horribly

> > inefficient if all the items are not unique.

>

> But will it matter since you will be spending a

> significant time accessing the hard disk?

>

Yes. Perhaps moreso even. My way means items are not repeatedly resorted.

> P.S. I use 'heap sort' when I cannot use the built in

> Java sort. On overage it takes twice as long as quick

> sort but it's worst case performance is very similar

> to it's nominal.

I see. I'll look into that.

sabre150a at 2007-7-21 8:45:24 > top of Java-index,Other Topics,Algorithms...
# 18
> Is this (difficulty of proving) really the answer to> my question though? Not really. I don't know the answer to your original question. I was just trying to explain what I think YATA meant by the "proving it correct" comment.
jverda at 2007-7-21 8:45:24 > top of Java-index,Other Topics,Algorithms...
# 19
On proving code correct: with most code I don't bother. However, there are a number of ways of writing partitioning code with subtle bugs, so I prefer to prove my implementations quicksort and quickselect correct.You didn't answer my question as to whether your sort is in place.
YAT_Archivista at 2007-7-21 8:45:24 > top of Java-index,Other Topics,Algorithms...
# 20

I guess I'll be the token dinosaur for this one.

QuickSort was developed to be quick on an array where you have random access to every element in the sort. It is an "internal" sort method meaning that it assumes that you can fit the entire array into memory.

It was quick in the days when people sorted arrays of integers and you could scan for elements that needed to be swapped with a single microcode assembly instruction. It has BAD worst case performance, like it is order n^2, also it is not stable, but that really fast scan generally made quick sort faster than routines like heapsort or mergesort that had better worst case behaviors. That's why they called it quick.

Once you move out of arrays, and or move away from compares that can be done by a single REP CMPSW instruction, some of the advantages of quicksort begin to fade.

The poster who said that the magic number is dependent upon the recursive call overhead was correct but just to stick my neck out, the magic number is 5.[ I looked it up - Sedgewick claims between 5 and 25 and that it leads to about a 20% performance gain]. What I mean is that the number is not big. None the less, a well written Quicksort library routine will often stop calling quick sort recursively when the subarray is tiny (5 or less) and will do one single insertion sort pass on the entire array when all the recursion is done. Note this improvement, using insertion sort for nearly sorted lists, works for arrays but not for singly linked lists. In the days when one wrote code for a single machine (because one worked for IBM) one could and would tune a library routine for a single machine. In these days of Virtual Machines chances are this optimization is either not done, or if it is done, some random number like 5 is selected and no tuneing is done.

The situation that you are dealing with is know as an "external" sort, where you have long lists that do not fit in memory all at once. A different set of techniques have been developed for dealing with this problem, most of them involving loading as much into memory as one can, doing an internal sort using quicksort or something else and finally merging the remaining sorted chunks.

It is no real surprise that a collections package assumes that you have collections that fit into memory. External collections would require some serialization notion that would tend to be problem specific.

It also sounds like you are dealing with linked lists rather than arrays. If that is the case then merge sort is generally preferable. It is stable, it has better worst case characteristics and written correcly it will run as fast as quicksort.

That is to say, it will run as fast as quick sort if your quicksort keeps the items as lists. On the other hand, If you have the memory to convert the list to an array, It still might be faster to convert your list to an array, quicksort the array and then convert back to a list. The reason this might be faster is just because list processing with its extra memory reads could be slower than array processing.

So the answer to your final question, why don't people keep 3 lists rather than 2 in quicksort, the answer is because quicksort is not a good choice for sorting lists and thus the problem never arises.

Enjoy

marlin314a at 2007-7-21 8:45:24 > top of Java-index,Other Topics,Algorithms...