"effecient" duplicate removal from an array
Hi,
I have looked at a lot of previous posts about how to effeciently remove duplicates OR detect duplicates in an array of integers, but unfortunately most of them are
of O(n^2) time complexity
or
of O(n log n) time complexity with o(n) space complexity
I was wondering if there is a faster alogrithm (although harder to understand) which beats this complexity. I am not looking to use any Objects like hashsets/map/arraylista for storage. Just C type array maniupulation.
Duplicate removal can be reduced to sort. There are O(n) sorting algorithms that work for integers. See Cormen, et al. Introduction to Algorithms, chapter 8, or Google for "radix sort", "counting sort", "bucket sort".
Dump List into a HashSet.Dump List out of the HashSet.Voila, done.~Cheers
> Dump List into a HashSet.
> Dump List out of the HashSet.
> Voila, done.
>
> ~Cheers
Oh, you can't use the right tools for the job. Sorry.
How big is this list, anyway? If it's reasonably small (I'd say under a million elements) then O( n log n) probably isn't going to be too bad. What's your reason for needing an order n algorithm?
use a set to hold the values and if the set does not contains the value, then insert it into the array. I believe the Set add(Object o) is O(n) then this should give you O(n). The downside if the requirement of extra memory space as big as the array itself.
> use a set to hold the values and if the set does not
> contains the value, then insert it into the array. I
> believe the Set add(Object o) is O(n)
> then this should give you O(n).
Yes, depending on which Set implementation you use... add() for HashSets is O(1) which would give O(n) total running time, but TreeSet (or whatever its called) is O(log n) .
Of course, this is the right way to do this, but the OP has that silly requirement to not use "anything like that"...
Easy. That youngsters they don't know even algorithms that were used few tenth years ago.
There is easy nice algorithm for that, it needs just a 3 times of original memory. (If you need to preserve original array.)
There is just a little problem, if you are university educated, from the last few years, you'd have a big problem to understand it, and implement it.
A(4*a+a)
This is better aproximation of complexity. Using aproximation based on upper bound of algorithm could cause problems.
Raghar
Here is an algorithm that while not exactly linear is fairly easy to understand, should have much better runtime than sorting the elements, and lastly requires no extra storage out side of the array other than a couple state variables.
Primarily we use the table itself as a hashtable. You can then define an element as being in the proper place if Hash(A(i)) == i.
Now make one pass through the list doing the following. Take each element and try to move it to its proper place, however there is one restriction: do not move it to its proper place if the element sitting there is already in its proper place.
During the attempt to move an element to its proper place, if you find that you have a duplicate, then remove it. By remove it, I mean, thread this location onto a list of empty locations, using the space in the array to hold the list pointer. Thus the empty list takes up no space outside the array (and yes, i know this will not work if you were talking about an array of bytes, so it goes)
During this pass, if you find that you can move an element, simply swap it with the element that did not belong in the target slot and now try to place this new element. Fear not, this process must terminate. Every step puts another element in its proper place.
By the time you are done with this pass you will now have the array in the following condition. There are some elements in the array that will be sitting in their proper place. These elements will be unique, ie all their duplicates will have been removed because the first guy that made it into his proper slot will never be displaced from that slot and anyone that matched will be removed as a duplicate.
You will have some elements that are not in their proper place (which you can always test by hashing them and seeing that they are not in place). These elements came about because of hash collisions, they hashed to a place that was already taken. As a result you have not processed them. There may still be duplicates in this batch.
And lastly, you have an empty list that is conveniently in sorted order (because you were going through the original array in order and thus the trail of empties is in sorted order.)
Now you need to unwind this mess, Sorting the known to be unique guys to the front, the collision dudes right behind them and the empties all the way at the end. This can be done in a single straight forward pass. You look at each element for processing. First you check if it is at the head of the empty list, and then if it isn't empty you hash it to see if it is in its proper place or not. Once you have identified the element as to which of the 3 groups it is in, you can roll it into the growing initial part of the array which is always kept in those three sections, Unique guys, followed by collision possibly non-unique guys, followed by empty guys.
This untangling of the lists is also clearly order N.
Needless to say, when you are done with the first pass, you can now safely ignore the initial and the final portions of the array and just do the same thing to the colliders in the center. The prudent observer will however notice that this is very wastefull of the empty space at the back. You can simply keep the empties in the later passes to further spread the hash function.
A single pass, hashing and then untangling, of this algorithm is clearly of order N. The next pass will be on a reduced array but the reduction will be based on the probabalistic nature of the hash function. For example it is entirely possible that on your first pass you had an array with a million copies of a single element and one single unique element and those two elements just happen to hash to the same value and the single unique one just happened to be in its proper place. So you process a million and one elements on the first pass and have a million to process on the second pass.
So it goes. In general however, the number of hash collisions will be a small fraction of the array.
Just for the record, personally I would just sort and prune and be done with it. Easy code is generally much preferable to theoretically fast algorithms. But it was fun to think of an alternative. Thanks for the question, what a nice way to start the new year. If anyone implements this and it actually runs, do let me know.
Enjoy!
> Andd lastly, you have an empty list that is conveniently in sorted orderI'm going to assume this clause doesn't accurately convey your intent.;~)~Cheers
You are correct.I meant to say And:)
> I meant to say AndNo, you said "And;" that extra 'd' was my typo.> And lastly, you have an empty list that is conveniently in sorted orderJust think about it for a second ;~)Time's up. When is an "empty list" not in sorted order?
we are talking code here, not linguistics.
a dog list is a list of dogs
a cat list is a list of cats
and an empty list is CLEARLY a list of empties.
you may have several empties on that list.
Had I meant to say that the list was empty, I would have said that the list was empty, wouldn't I?
However now that you mention is, what if the empty list is actually empty?
Humm...We seem to be getting within epsilon of a zen koan here. This might mean something... but then again... probably not. :P
Thanks for reading.
Can't you just quicksort the array? That's usually O(nlogn) time complexity (especially if you use the one in java.util.Arrays, which is extremely good at avoiding the worst case). It operates in place, so virtually no additional memory is required -- other than the memory required for a couple of operating variables and stack space and whatnot. Unfortunately, you can't do better than O(n) space, since you have to at least process the array itself.
In general, it is impossible to do better than O(n) space if the input data needs to be manipulated.
> There are O(n) sorting algorithms that work for integers.
Unfortunately, these are at least O(n) space complexity in best case and have horrible worst case space (try sorting {-1e6, 1e6} using a bucket sort...). However, if you know the range of values you will be working with, they may be plausible solutions.
> > There are O(n) sorting algorithms that work for
> integers.
> Unfortunately, these are at least O(n) space
> complexity in best case and have horrible worst case
> space (try sorting {-1e6, 1e6} using a bucket
> sort...). However, if you know the range of values
> you will be working with, they may be plausible
> solutions.
Radix sort uses O(n) space.
> > And lastly, you have an empty list that is
> conveniently in sorted order
>
> Just think about it for a second ;~)
>
> Time's up. When is an "empty list" not in sorted
> order?
When there could some nonempty lists that points on the empty list, prefferably to different locations of the empty list.
It's somewhat like cometo (jump from address, as opposed to jump to address) without its unceirtanity.
So could we call this YAT?
Raghar
Correction: RadixSort is O(nk), where k is the average key length. k could easily exceed log(n), meaning RadixSort isn't necessarily any better QuickSort.
> Radix sort uses O(n) space.
> > Correction: RadixSort is O(nk), where k is the
> > average key length. k could easily exceed log(n),
> > meaning RadixSort isn't necessarily any better
> > QuickSort.
Radix sort uses O(n) space.
Or should. I'm sure you could make on that doesn't.
> Radix sort uses O(n) space.> > Or should. I'm sure you could make on that doesn't.Aah, I misread your post. Well played.
radix sort is also O(n log n) time complexity. I think the point is that you can only get O(n) time complexity if you have a potentially large space complexity. If you're willing to go with O(n log n) time, there are a number of algorithms with O(n) space. I can't find anything that is a simple O(n) for both time and space. This makes sense though as most any theory professor will tell you there are no O(n) sorts out there for all cases.
looking more into it, you can get a radix sort with O(n) time complexity, but it has the same problem as the other two searches in that it requires a hard coded limit. So again, if you know the bounds on your problem space is small, you can utilize any of the algorithms you mentioned above in O(n) time. But if the bounds unknown, you're stuck with O(n log n) time sorts.
> This makes sense though as
> most any theory professor will tell you there are no
> O(n) sorts out there for all cases.
This is true for normal computers but for quantum computers that's not necessarily the case. Whether quantum computers become viable remans to be seen but it looks pretty likely that they will come to fruition.
> This is true for normal computers but for quantum
> computers that's not necessarily the case. Whether
> quantum computers become viable remans to be seen but
> it looks pretty likely that they will come to
> fruition.
There are no quantum-sorting algorithms. The only problems for which superior quantum algorithms have been discovered are the following: factoring, discrete-log, quantum mechanics simulation, and quantum database search.
> This is true for normal computers but for quantum
> computers that's not necessarily the case. Whether
> quantum computers become viable remans to be seen but
> it looks pretty likely that they will come to
> fruition.
There are no known quantum-sorting algorithms. The only problems for which superior quantum algorithms have been discovered are the following: factoring, discrete-log, quantum mechanics simulation, and quantum database search.
> There are no known quantum-sorting algorithms. The
> only problems for which superior quantum algorithms
> have been discovered are the following: factoring,
> discrete-log, quantum mechanics simulation, and
> quantum database search.
http://www-math.mit.edu/~sipser/ntt/quantum-computing.html
under papers (the paper is really difficult to read. Maybe becasue my Acrobat is old.)
"We consider the problem of inserting one item into a list of N-1 ordered items. We previously showed that no quantum algorithm could solve this problem in fewer than log N/(2 log log N) queries, for N large. Here we give a quantum algorithm that performs the insertion in fewer queries than is classically possible. Our result is that the insertion problem can be solved in 0.53 log_2 N quantum queries for large N (where log_2 N is the classical lower bound). One implication of this result is that N items can be sorted in 0.53 N log_2 N queries, better than possible classically."
> http://www-math.mit.edu/~sipser/ntt/quantum-computing.
> html
> ...
Err..., that kind of brings up the question of what we mean by "better" when it comes to algorithms. A constant factor speed-up is normally beneath the notice a computer scientist because of the linear-speedup theorem. But it's still an O(nlogn) algorithm, just a substantially superior one. But your point is well made -- there it definitely the suggestion there that an O(nlogn / 2loglogn) algorithm may exist.
That's why I like these forums -- so many smart people with interesting references. Lots to learn.
> looking more into it, you can get a radix sort with
> O(n) time complexity, but it has the same problem as
> the other two searches in that it requires a hard
> coded limit.
Part of your program could need a some hard coded limit for decision about algorithm.
> But if the bounds unknown, you're stuck with O(n log n) time
> sorts.
He said array of integers, for such situations is radix sort probably best., and stable on top of that. (Aside cache blasting) From my experience it wasn't too harsh even for BW transformation.
> Err..., that kind of brings up the question of what
> we mean by "better" when it comes to algorithms. A
> constant factor speed-up is normally beneath the
> notice a computer scientist because of the
> linear-speedup theorem. But it's still an O(nlogn)
> algorithm, just a substantially superior one. But
> your point is well made -- there it definitely the
> suggestion there that an O(nlogn / 2loglogn)
> algorithm may exist.
Yes I thought that was odd too. I just grabbed the first thing I could find. There were references to 'Oscar's Algorithm' not sure what that is though.
You would think that the ability to search an entire array in one operation would provide some benefit for sorting but I have to admit my interest in this is purely recreational. I actually know somone who is working on Quantum computers. I can't get him to tell me anything more than I already know, though.
Well given that this thread has devolved into discussions of the theoretical speed of sorting on non-existent machines it is unlikely that anyone cares about the OPs question anymore, namely if there exists an actual order N algorithm for eliminating duplicates from an array in place.
The answer is yes, the algorithm that I gave is in fact order N, it just took me a few days to see it. Unfortunately the algorithm as described did have one serious drawback, namely that it was wrong. It looked good through the beer goggles a couple days ago but a few days later it leaves a little something to be desired. Fortunately the fix simplifies the algorithm.
First the bug: When you hash an element to put it in its proper place, it is possible that that element landed on an empty slot. Since the empty list was kept as a list, you would need to run the list to check. While this is not an insurmountable problem the running of a list in the middle of a supposedly fast algorithm is never a good thing.
The fix is this: instead of keeping a list of empties, you instead take the first element at the top of the array before you start processing, pull him off into a variable and declare him to be the value that designates "empty". Viola, in one step you have possibly removed a ton of duplicates and also have marked the slots that used to hold that value as empty. Furthermore, you now no longer maintain a list at all. No list processing needed, just a marker for the empty slots.
The reason that this is an order N algorithm, is that on pass 2 I am not doing N amounts of work again, rather I am doing some small fraction of N amounts of work. What leads to the NlgN behavior in an algorithm like in a sort is where on pass 1 you touch all N items and then on pass 2 you process N/2 items PLUS N/2 items.
On the other hand if you do N units of work on pass 1, then N/2 units of work on pass 2, then N/4 units of work on pass 3 ... you are doing (1 + 1/2 + 1/4 + 1/8 + ...)N = 2N units of work. Any algorithm where in a single pass you eliminate at least a constant fraction of the elements from processing will be order N. This algorithm should actually decrease faster than by one half each time and thus converge in something less than 2N steps. Thus the algorithm I listed is actually of order N. Case closed.
Sorry for the sloppy mistakes; I was just not thinking straight when I first outlined it.
Enjoy
Hi,Do you mind giving more information on how you do that:> Primarily we use the table itself as a hashtable. You> can then define an element as being in the proper> place if Hash(A(i)) == i.regards,Cedric
Sure - the quick and dirty way to hash, which would be fine for triming a list of integers, would be to reduce the integer mod the table length.
Given that you will at later stages of the algorithm be working with smaller portions of the original array you will have something like this:
int s; // the starting index of where we are working in the array
int m; // the lenght of the sub array that we are processing
int[] a; // the initial array that we are reducing down
// when I say an element is in its proper place in a[] I mean
if ((a[i]%m + s) == i)
If you're actually interested, I could probably dig up the code and post it. I wrote an integer version just to verify that it would be as easy as I though. I didn't post it because I got the impression from the way that the thread was winding that no one actually cares about an order n solution to this problem.
And naturally I didn't do any of the hard work like beating the hell out of it with test routines, and comparing the speed to the quicksort alternatives or any of the other stuff that one would need to do if one were going to write a paper.
So if ya want it, you can have it.
> Sure - the quick and dirty way to hash, which would
> be fine for triming a list of integers, would be to
> reduce the integer mod the table length.
>
> Given that you will at later stages of the algorithm
> be working with smaller portions of the original
> array you will have something like this:
>
> > int s; // the starting index of where we are working
> in the array
> int m; // the lenght of the sub array that we are
> processing
> int[] a; // the initial array that we are reducing
> down
>
> // when I say an element is in its proper place in
> a[] I mean
>
> if ((a[i]%m + s) == i)
>
>
> If you're actually interested, I could probably dig
> up the code and post it. I wrote an integer version
> just to verify that it would be as easy as I though.
> I didn't post it because I got the impression from
> the way that the thread was winding that no one
> actually cares about an order n solution to this
> problem.
>
> And naturally I didn't do any of the hard work like
> beating the hell out of it with test routines, and
> comparing the speed to the quicksort alternatives or
> any of the other stuff that one would need to do if
> one were going to write a paper.
>
> So if ya want it, you can have it.
Hi,
it's me again (CrashOverride, forgot somehow my passwd ..).
Thanks for the explanation. I had thought of using the integer mod the length of the table but what if the integersin my array are all multiples of the length ? I have no information about the range of the integers in my table. It'd have dramatic consequences if I happen to lose one of the integers.
I'm actually interested by an order n solution. I dun really need the code. I just need to understand the concept.
First of all, I made a mistake in that little code snippet. I forgot that negative numbers have a negative modulus in most programming languages. You need to fold negatives up into range.
In the degenerate situation that you suggest, where the entire list consists of exactly N numbers that are all exact mulitples of N (with some duplicates that need removal of course), here is what will happen.
The first number in the list is delcared to mark the empty element and is thus removed from consideration. The number that is in the second slot will be hashed and being an exact multiple of N will land in slot 0 (which used to be empty). That number is now in its proper place.
Now in the rest of the first pass, since by design every remaining number also hashes to zero they will not be able to move there. Dups of the second number will be removed but everything else will collide and must wait for the next pass.
The result of the first pass is a very poor result that you have eliminated the dups of only two numbers and have nearly the entire rest of the array to deal with.
HOWEVER on pass two the size of the array you need to process is now reduced by at least 2 (more if there were any dups of the first two elements) as a result the hash function for pass 2 is now a reduction mod N-2 (you only apply the second pass to the strictly smaller array) and suddenly it distributes all those multiples of N very nicely. You do not use the entire array as a hash table, only the subportion that you are processing.
The result is that you may get one bad pass but it is hard to get more than one bad pass (think Chinese Remainder theorem) I mean, to get more than one bad pass the numbers have to be both multiples of some big N and some second number almost as big N-2. To get M bad passes the numbers have to all just happen to be multiples of some large N^M. If you list was only a thousand elements long, to get 3 bad passes, the numbers would all need to be exact multiples of some number around a billion. Very quickly you are out of any reasonable range for your integers.
so again in psuedo code
Part 1:
pull out the first number and declare it to mark empty slots
go down the list (skipping empty elements and properly placed elements)
and move each unproperly placed element into place if you can
(you are never allowed to displace an already properly placed element)
also when you go to place an element if it is a dup just mark it as empty and move on
Part2:
You now have the array consisting of 3 types of elements all randomly mixed
a) properly placed
b) improperly placed
c) empty
in one single sweep you rearrange the elements (you are not sorting so order does not matter) so that the array groups those elements in that order so that all properly placed elements are to the front of the array and all emptys to the back.
(Obviously during this rearranging the properly placed elements will no longer be properly placed)
At the end of Part 2, you can now ingore the front part of the array and the back part of the array and just do the same thing to the middle section.
Repeat until the middle section is essentially empty (one of less elements)
Sample code for integers:
public static class RemoveDups{
int[] a;// the array in 3 groups Finished, StillToDo, Empty
int start; // first element of StillToDo group
int end;// first element of Empty group
int arraySize;// actual size of original array
int modulus;
int empty;// value that marks a slot empty in stage 1
int hashCount = 0;
RemoveDups(int[] array){
a = array;
start = 0; end = a.length; arraySize = a.length;
setModulus();
}
void setModulus(){modulus = end - start;}
int hash(int value){
hashCount++; // for amusement
int t = value%modulus;
if (t<0) t += modulus;
return t + start;
}
public String toString(){
String res = "1st="+start+" empty="+end+" {";
for(int i = 0; i<end; i++){
res += a[i] + ", ";
}
return res + "}";
}
public void removeAllDups(){
while(modulus >= 2){
System.out.println("internal pass"+this);
removeDupsPass();
}
System.out.println("hashRatio = "+ (1.0f*hashCount)/arraySize); // for amusement
}
public void removeDupsPass(){
empty = a[start];
// stage 1 - hash the elements
for(int i = start + 1; i<end;){
if(a[i] == empty || hash(a[i]) == i){ //if islot empty or properly placed
i++; // we are done with this one
}else{ // need to try to place this one
int k = hash(a[i]); // k is where we want to place it
if(a[k] == empty || hash(a[k]) != k){// if target empty or target is NOT placed
int t = a[i]; a[i] = a[k]; a[k] = t; // it is suitable target so we swap it
} else { // couldn't swap because target was properly placed so is collision
if(a[i] == a[k]) { // ..either the collision was a dup in which case
a[i] = empty; // remove the dup
}else{// .. or collision was true collision so abandon this element
i++;
}
}
}
}
// stage 2 - rearrange the elements
// at this point every element is either empty, inPlace or a collider
// we move inPlacers to front, emptys to the back.
int firstCollider; // points to start of collider section
int firstEmpty;// points to start of empty section
firstCollider = start;
firstEmpty = start;
for(int i = start; i><end; i++){
int elt = a[i]; // pick it up so that we can swap other elements
if(elt != empty){
if(hash(a[i]) == i){ // was in place so move to front section
a[firstEmpty] = a[firstCollider];
a[firstCollider] = elt;
firstCollider++; firstEmpty++;
} else { // was a collider so move to middle section
a[firstEmpty] = elt;
firstEmpty++;
}
}
}
// finally, place the empty that you pulled out in the very first step back in
a[firstEmpty] = a[firstCollider];
a[firstCollider] = empty;
firstCollider++; firstEmpty++;
//last - update the structure and setup for the next pass
end = firstEmpty;
start = firstCollider;
setModulus(); // sets us up for the next pass
}
}
If you wanted to do EVEN MORE WORK. You could observe that once you have successfully removed duplicates on the very first pass you could use the upper empty portion of the hash table as a celler (an open chaining region for further resolution of hash collisions).
In the algorithm I outlined and implemented, any true hash collision (two different elements with the same hash code) results in an element that is not processed in this pass but must wait for the next pass.
If there was a celler, true hash collisions could be chained into the celler. This makes the hash algorithm more complicated but could in theory reduce the total number of passes that the algorithm must make. I elected to write the simple hash code to show that in principle the algorithm would work.
Enjoy!>
Great ! it rocks.
I appreciate the effort you did to answer my question.
I was using another algo (I dun think it was posted here):
- sort the array
- Then:
int removeDuplicates(int a[], int array_size) {
int last = a[0];
int j = 0;
for (int i = 1; i < array_size; i++) {
if (a[i] != a[j]) {
j++;
a[j] = a[i]; // move it to the front
}
}
return j + 1;
}
which returns the number of unique integers and they stand at the beggining of the array. But it's not in O(n) because we have to sort the array first.
Cedric
> But it's not in O(n) because we have to sort the array first.Thre are O(N) sorts for integers.Look for "Radix Sort"
Yes, Radix Sort is indeed O(n) for integers but as you no doubt recall the particular challange here was to remove duplicates WITHOUT the use of an auxillary structure, like a hash table, or any other extra space consuming methods.
If you have the extra space to work with, stuffing elements into a HashSet, which was your first suggestion, is both order N, obviously correct, and easy to implement as well. Why would you screw around with a Radix sort to remove dups if you can build a hash table?
And if you don't have any extra space to work with, how exactly does the fact that Radix Sort is O(N) help. Can you do a Radix sort in Place? If so, I'd love to see it.
> Yes, Radix Sort is indeed O(n) for integers but as
> you no doubt recall the particular challange here was
> to remove duplicates WITHOUT the use of an auxillary
> structure, like a hash table, or any other extra
> space consuming methods.
I disagree. In the very least case, you could have a preprocessing loop that checked to see how long the blocks would need to be for each radix. So it can be done in place.
> we are talking code here, not linguistics.
>
> a dog list is a list of dogs
> a cat list is a list of cats
> and an empty list is CLEARLY a list of empties.
>
I disagree. The word "empty" is a quanitifer, where as "dog" and "cat" are not.
>I disagree. In the very least case, you could have a preprocessing loop that checked to see how long the blocks would need to be for each radix. So it can be done in place.
OK - help me out here. I do a preprocess and know how large each bin needs to be... and then ...
And don't tell me how to do the first sort in place where order doesn't matter. I think I can do that. Tell me how I do the second pass where I need a stable sort in place.
>The word "empty" is a quanitifer, where as "dog" and "cat" are notYou got me there. Guess I'm in the dog house now.
> >I disagree. In the very least case, you could have a
> preprocessing loop that checked to see how long the
> blocks would need to be for each radix. So it can be
> done in place.
>
> OK - help me out here. I do a preprocess and know
> how large each bin needs to be... and then ...
>
> And don't tell me how to do the first sort in place
> where order doesn't matter. I think I can do that.
> Tell me how I do the second pass where I need a
> stable sort in place.
Okay, do a prepass to figure out how large the bins will be, and set ints for each digit at the start of their bin. Pass through the entire list and fill up each bin by swapping with bin_index[digit]++.
Re-run on each bin 'til sorted.
I remain unconvinced that you can do a radix sort in place.
I will gladly sing your praises as a genius if you show me the code that does this but I see a large wall looming and I tire of attempting to walk you into it so that you can have the learning experience of smacking into the wall.
I do not see how to write an in place radix sort from your suggestion. I am willing to entertain the possibility that it can be done, that you are right and I am wrong, but I don't believe it for a minute.
I am calling your bluff. Show me your hand.
> I remain unconvinced that you can do a radix sort in
> place.
>
> I will gladly sing your praises as a genius if you
> show me the code that does this but I see a large
> wall looming and I tire of attempting to walk you
> into it so that you can have the learning experience
> of smacking into the wall.
>
> I do not see how to write an in place radix sort from
> your suggestion. I am willing to entertain the
> possibility that it can be done, that you are right
> and I am wrong, but I don't believe it for a minute.
>
>
> I am calling your bluff. Show me your hand.
Blast, now I'm going to have to write the bloody thing...
I hate writing sorts...
=please hold=
*elevator music*
> If you have the extra space to work with, stuffing elements into a HashSet, which was your first
> suggestion, is both order N, obviously correct, and easy to implement as well.
Only will average kN if no collisions in the keys.
>Why would you screw around with a Radix sort to remove dups if you can build a hash table?
It's order N in worst case, not proportional to N in typical cases.
> Can you do a Radix sort in Place?
Only on a doubly linked list (O(1) insertion allows stablity without increasing cost).
Pete (I've a poorly wrist so probably won't be typing masses of code for a while)
> > Can you do a Radix sort in Place?
>
> Only on a doubly linked list (O(1) insertion allows
> stablity without increasing cost).
Nope, nope, it'll work on arrays too. Just give me my time (i.e. or kill Ovid so he won't be bugging me anymore). Should be twice as expensive as a regular radix sort, but in place...
*car commercial music*
> > > Can you do a Radix sort in Place?
> >
> > Only on a doubly linked list (O(1) insertion
> allows
> > stablity without increasing cost).
>
> Nope, nope, it'll work on arrays too. Just give me my
> time (i.e. or kill Ovid so he won't be bugging me
> anymore). Should be twice as expensive as a regular
> radix sort, but in place...
>
I don't think it's possible to do an inplace stable radix sort.
Suppose you were sorting the following column:
0
0
1
0
0
...
When you get to the '1' bit, you have to make sure you swap it
in a place such that it comes before any other next 1 bit that
is equal. If all you do is swap it to the end, you haven't
gauranteed the stable property.
If you try to start scanning and shifting the bits, then you've
turned it into an O(n^2) algorithm.
What it comes down to, is removing the 1 bits, compacting the 0
bits, and then reinserting the 1 bits in the order they were
removed at the end.
E.g.
00a0
a-b0
00c0
b-0
00a
c-b
00c
I think the problem is because the compaction phase is going to
overwrite data (unless you store it in a temporary array).
However, as mentioned before, this is possible using a doubly linked list.
All you would need to do is keep track of the tail 0 and scan through the bits.
When you encounter a 0 bit, remove it, add it to the tail zero bit, and then set it
to be the tail 0 bit. This is possible because removal and insertion are O(1)
operations for a doubly linked list.
Forgive me for quoting Knuth once again. I know, he's an old guy and we are way beyond anything those old cats ever dreamed of. Furthermore it kinda dates me as the sort of guy that reaches for a book on a shelf rather than just googling for the result, but from Vol 3 of the old testament, Sorting and Searching, in the section on internal radix sorting we find.:
[blockquote]
...In order to handle radix sorting inside a computer, we must decide what to do with the piles. Suppose that there are M piles; we could set aside M areas of memory, moving each record from an input area into its appropriate pile area. But this is unsatisfactory, since each area must be large enough to hold N items, and (M+1)*N record spaces would be required. Such an excessive memory requirement originally caused most people to reject the idea of radix sorting within a computer, until H.H. Seward [Masters Thesis, M.I.T. Digital Computer Labratory Report R-232 (Cambridge, Mass.: 1954), 25-28] pointed out that we can achieve the same effect with only 2N record areas and M count fields. We simply count how many elements will lie in each of the M piles, by making a preliminary pass over the data; this tells us precisely how to allocate memory for the piles. We have already made use of the same idea in the "distribution sort," Algorithm 5.2D.
[/blockquote]
So the observation that you can make a prepass was worth a masters degree 50 years ago. But the mention of the need for 2N + M memory does not argue well for being able to do it inplace. They use an extra area and pour the information back and forth.
Granted, the mere fact that Knuth and his old buddies did not find a way to do it does not constitute a proof that it cannot be done. The fact that those old trolls apparently did not notice that you could do a prepass does not argue that they were the brightest algorithm bulbs on the planet.
On the other hand... The fact that the old testament does not list inplace radix sort as a classic algorithm strongly suggests to me that it has never been done to date. IF you can do it in place there is probably a modern Master's Thesis lurking. In fact, a proof that you could not do it in place is probably worth and MS too.
From having tried to develop this algorithm long ago I remain fairly confident that it can not be done.
Here's what I have so far. It compiles, looks okay, but when it runs it throws an ArrayIndexOutOfBoundsException on the line with lots of comments around it. I don;t get it...
import java.security.SecureRandom;
import java.util.*;
public class RadixSort
{
public static void main(String[] args)
{
Random rng = new SecureRandom();
int[] i = new int[rng.nextInt(100)];
for(int x = 0 ; x < i.length ; x++)
i[x] = rng.nextInt() & (-1 >>> 1);//I don't want to mess with negative numbers...
//System.out.println(Arrays.toString(i));
sort(i);
//System.out.println(Arrays.toString(i));
System.out.println(isSorted(i) ? "List was sorted." : "You are a failure as a programmer.");
}
public static void sort(int[] i)
{
sort(i, 0, i.length, 1000000000);
}
private static void sort(int[] i, int start, int end, int radix)
{
/***************************/
/*evilln*/if(radix == 0 || start == end)
/***************************/
return;
int[] indices = new int[10];
boolean sorted = true;
for(int x = 0, N = i.length ; x < N ; x++)
{
indices[(i[x] / radix) % 10]++;
if(sorted && x + 1 != N && i[x] > i[x + 1])
sorted = false;
}
if(sorted)
return;
for(int x = 0, sum = 0 ; x < indices.length ; x++)
{
indices[x] = (sum += indices[x]);
}
for(int x = indices.length - 1 ; x > 0 ; x--)
{
indices[x] = indices[x - 1];
}
indices[0] = 0;
for(int x = 0 ; x < i.length ; x++)
{
swap(i, x, indices[(i[x] / radix) % 10]++);
}
}
private static void swap(int[] i, int a, int b)
{
int temp = i[a];
i[a] = i[b];
i[b] = temp;
}
private static boolean isSorted(int[] i)
{
for(int x = 0, N = i.length - 1 ; x < N ; x++)
if(i[x] > i[x + 1])
return false;
return true;
}
}
> Here's what I have so far. It compiles, looks okay,> but when it runs it throws an> ArrayIndexOutOfBoundsException on the line with lots> of comments around it. I don;t get it...It threw a what? Where? Check that line number again.
If you are talkign about the line marked evilln, that line cannot throw an indexoutofbounds exception.
> If you are talkign about the line marked evilln, that
> line cannot throw an indexoutofbounds exception.
That's what I said (to my computer). Oddly enough, it did the same thing on sun's Vm and jRockit (newest version, both)...
javac -g -source 1.5 -deprecation -nowarn RadixSort.java
java -server RadixSort
>>>
java.lang.ArrayIndexOutOfBoundsException
at RadixSort.sort(RadixSort.java:24)
at RadixSort.sort(RadixSort.java:20)
at RadixSort.main(RadixSort.java:12)
exit code for process is 1.
operation complete.
>>>
I get that the exception occurs in the swap method when called from here:
for(int x = 0 ; x < i.length ; x++)
{
swap(i, x, indices[(i[x] / radix) % 10]++);
}
> I get that the exception occurs in the swap method> when called from here:Yeah, I had a friend run it and he got the same. Mine's just screwy, I guess...
What are you attempting to do? (i[x] / radix) % 10 is only producing 3 values really. Mostly just 0 and 1.
> What are you attempting to do? (i[x] / radix) % 10
> is only producing 3 values really. Mostly just 0 and
> 1.
An in-place radix sort.
Notice, please, that I'm passing in random positive int values in a list. All the numbers fall between 0 and 2**31, exclusive. Thus, there are only three possible values for the the first digit (in this case the ninth digit from the decimal place). And the odds of a two are pretty small...
Aww, crup.
Swapping into the array makes for some messy things to deal with (like that AIOOBException). I'm still fairly well convinced that it can be done (a little extra management in the swapping loop), but I'm also fairly well convinced it's not worth the effort to try right now (especially with memory being so cheap and in-placeness not being a normal constraint).
In effect, conceded. For now.
~Cheers
I want to sincerely thank you for making the effort. Not all attempts to do new things succeed but it is only through attempts like these that progress is made.
Suppose I gave you a different challenge. Suppose the challenge was, given an array contain negative and positive integers, could you come up with an O(n)algorithm that would move the negatives to the front of the list and the positives to the end of the list, such that order of any equal integers is preserved. The goal is of course to write the algorithm using O(1) memory.
E.g.
-10, 75, -2, 43, 12, -55, 11, -10, 124, 75, -2, 113
Would be become:
-10, -2, -55, -10, -2, 75, 43, 12, 11, 124, 75, 113
In what way do you believe this is different to one pass of a radix sort with 2 buckets?
> In what way do you believe this is different to one
> pass of a radix sort with 2 buckets?
It's a simplified way of thinking of the problem he's trying to solve. He's not actually trying to solve radix sort, but the challenge of doing an inplace stable radix sort without using O(n) memory. I just wanted to give a problem which I think will show that it's not possible.
If you allow yourself to use O(n) memory, it's easy to solve. You don't even need two buckets, even a single bucket would do.
I wouldn't characterize it as a Radix sort with two buckets. I would characterize it more like one piece of a quick sort using 0 (the nummber, not the index) as the pivot.
// int[] i
int index = 0;
for(int x = 0 ; x < i.length ; x++)
{
if(i[x] < 0)
swap(i, x, index++);
}
> I wouldn't characterize it as a Radix sort with two
> buckets. I would characterize it more like one piece
> of a quick sort using 0 (the nummber, not the index)
> as the pivot.
There's no sorting involved! I want to make the point that the problem of trying to do an inplace stable radix sort has nothing to do with actually sorting. The problem is swapping values in the array.
> > // int[] i
> int index = 0;
> for(int x = 0 ; x < i.length ; x++)
> {
>if(i[x] < 0)
>swap(i, x, index++);
> }
>
THIS IS NOT STABLE. E.g. -1, 9, -2, 9, -3
> There's no sorting involved! I want to make the
> point that the problem of trying to do an inplace
> stable radix sort has nothing to do with actually
> sorting. The problem is swapping values in the
> array.
I never bothered with stability. There's no sense to it when it comes to primitives. I believe your "point" was that this sorting* could not be done without some extra space. You were wrong. A radix sort using base two can be done without swap space, but at that point it's just a quicksort with a different name.
~Cheers
*And it is a sort, it just sorts the items on negativity instead of total value.
> I never bothered with stability. There's no sense to
> it when it comes to primitives.
Yes, but using integers to illustrate the problem is easier. I supposed I could've attached a memory address to each one.
> I believe your "point" was that this sorting* could not be done
> without some extra space.
That's correct.
> You were wrong.
Doubtful.
> A radix sort using base two can be done without swap space,
> but at that point it's just a quicksort with a different name.
This is why you think I'm wrong? Quicksort is an on average O(nlogn) algorithm. Radix sort is O(n). I'm not sure what the "base two" constraint is, or the fact that you are using a different algorithm, but neither of these points support your argument that it is possible in the general case.
> *And it is a sort, it just sorts the items on
> negativity instead of total value.
No. More correctly it's referred to as a partition or classification. Sort in the computer science sense implies an ordering of all elements such that
a[i-1] <= a[i] <= a[i+1] or a[i-1] >= a[i] >= a[i+1].
> > I never bothered with stability. There's no sense
> > to it when it comes to primitives.
>
> Yes, but using integers to illustrate the problem is
> easier. I supposed I could've attached a memory
> address to each one.
If you wanted, sure... I don't really see why that matters. We're worried about sorting, stability is a nice property (and a mandatory property, sometimes), but whether or not a sort is stable has nothing to do with whether or not it's a sort...
> > I believe your "point" was that this sorting*
> > could not be done without some extra space.
>
> That's correct.
>
> > You were wrong.
>
> Doubtful.
The negativity sorting, not the whole thing. Complete sorting couldn't be done without extra space except using base two for the radix.
> > A radix sort using base two can be done without
> swap space,
> > but at that point it's just a quicksort with a
> different name.
>
> This is why you think I'm wrong? Quicksort is an on
> average O(nlogn) algorithm. Radix sort is O(n). I'm
> not sure what the "base two" constraint is, or the
> fact that you are using a different algorithm, but
> neither of these points support your argument that it
> is possible in the general case.
Base two radix sort is almost exactly like a quicksort. And that can be done in place.
> > *And it is a sort, it just sorts the items on
> > negativity instead of total value.
>
> No. More correctly it's referred to as a partition or
> classification. Sort in the computer science sense
> implies an ordering of all elements such that
> a[i-1] <= a[i] <= a[i+1] or a[i-1] >= a[i] >=
> a[i+1].
It puts the items in order based on some attribute, it's a sort.
A note on base two radix sort being approximately the same as a quicksort:
Picture the list:
1010
0111
0110
1111
Now picture quicksorting with the pivots 1000, 0100, 0010, and 0001 (without, of course, inserting them into the list afterwards). You will find after the first loop, all the numbers beginning with 1 will be at the end of the list, and the numbers beginning with 0 will be at the beginning. The same would be true of a radix sort (starting with the most significant digit). A radix sort using base two is almost the direct equivalent to a quicksort, and both can be done in place. Obviously, though, the higher the base the higher the speed, and speed is a very important factor in sorting ;~)
~Cheers
> An in-place radix sort.
I understand that. I mean with that code specifically.
> Notice, please, that I'm passing in random positive
> int values in a list. All the numbers fall between 0
> and 2**31, exclusive. Thus, there are only three
> possible values for the the first digit (in this case
> the ninth digit from the decimal place). And the odds
> of a two are pretty small...
So why mod them into an array of length 10?
Had some beers with my sensei last night and discussed this problem of O(n) removal of duplicates and the whole radix sort thing. He pointed out a few things that I found useful. I don't know if it will clear anything up here, it will probably just launch another fire storm. However since I am the one that introduced the notion of an inplace stable radix sort let me start there.
I introduced the term stability because I though it was necessary for a radix sort to work. Stability in the strict sense of not changing the order of equal keys in the array is of course irrelevent in the task of sorting integers. I probably should not have used that term.
He pointed out, as several of you have, that I was wrong, you can in fact do a radix sort in place in an array with no extra memory other than the ususal couple of variables in a stack frame.
You don't use dozens of bins and you don't start on the least signigicant bit, rather you work from the most significant bit, and by working with a single bit, you have only two bins and can get away with swapping.
This is of course just like quick sort, where you do your initial partition by swapping the elements with a 1 first bit to the front and with a 0 first bit to the back. The other partitions are straighforward complicated only by the way that the bits order in positive and negative integers.
So, I was wrong about the radix sort. It can be done in place.
This binary version of radix sort will not in general be as quick as just doing a quick sort because it will make 32 full passes when run using 32 bit integers and any array that you have internally on a 32 bit machine will be shorter than 2^32 elements and thus the lg(N) factor for quicksorting that list will be less than 32.
However the next thing he said was the most interesting to me. He denied that radix sort was O(n). His reasoning goes something like this:
It takes O(n lg(n)) to sort n "distinct" numbers. If your numbers are all distinct and you had N of them, you will require at least lg(N) bits to represent the numbers and thus your radix sort requires lg(N) passes. If you knew in advance that all your numbers were NOT distinct you could technically do any sort in order N time because all you are doing is counting. Recall, we were talking CS theory, not anything as grubby as actually writing programs that run on real machines.
He points out that on any computer with a fixed size numeric representaion any list of those numbers that you sort, over a certain size, MUST have duplicates and that is technically a different problem and of different complexity from sorting N distinct elements.
Think about it - you could produce a modified quick sort that "notices" when all the elements in one of its sub partitions are all the same value and it cuts short the sorting at that point. (no need to sort if all the values in the subpartition are all the same, right?) What is the runtime of that version of quicksort? It will be O(N), because there is an absolute bounds on the number of partitions that you will ever need to do.
His point was actually this: it is not particularly productive to declare radix sort to be O(n) because if you do so declare it, you might as well declare quicksort to also be O(n) on the same basic grounds.
I found that to be a very interesting viewpoint.
So again, I apologize for doubting the claim that one could perform a radix sort in place. I was wrong.
I will leave it to you all to debate the actual runtime of radix sort if you care to do so. I am comfortable that I now understand what I wanted to about this particular topic and will bow out of this particular thread.
Thanks for listening.
> It puts the items in order based on some attribute, it's a sort
Get over it, nobody with a computer science degree would call separating negatives from positives a sort during a discussion. It would only be confusing. The fact that you are trying to defend calling it a sort only goes to show you are trying to be confusing and care nothing about actually discussing the problem.
> He pointed out, as several of you have, that I was wrong, you can in fact do a radix sort in place in an array with no extra memory other than the ususal couple of variables in a stack frame.Yes, but not a stable one.
> > It puts the items in order based on some attribute,
> it's a sort
>
> Get over it, nobody with a computer science degree
> would call separating negatives from positives a sort
> during a discussion. It would only be confusing.
> The fact that you are trying to defend calling it a
> a sort only goes to show you are trying to be
> confusing and care nothing about actually discussing
> the problem.
I referred to it as a sort once. It was a rather ambiguous reference, and I only meant to clarify what I meant. Not defend it.
If you'd like to discuss whether or not sorting negatives from positives is sorting, please start a new thread so this one doesn't get more cluttered than it already is.
> > He pointed out, as several of you have, that I was
> > wrong, you can in fact do a radix sort in place in an
> > array with no extra memory other than the ususal
> > couple of variables in a stack frame.
>
> Yes, but not a stable one.
Again, that has nothing to do with the sortiness of a sort.
> > An in-place radix sort.
>
> I understand that. I mean with that code
> specifically.
>
> > Notice, please, that I'm passing in random
> positive
> > int values in a list. All the numbers fall between
> 0
> > and 2**31, exclusive. Thus, there are only three
> > possible values for the the first digit (in this
> case
> > the ninth digit from the decimal place). And the
> odds
> > of a two are pretty small...
>
> So why mod them into an array of length 10?
Reference counting, so I'd know how big the buckets'd need to be.
> Reference counting, so I'd know how big the buckets'd> need to be.I guess I just don't get how 7 unused array elements helps with that.
> > Reference counting, so I'd know how big the> buckets'd> > need to be.> > I guess I just don't get how 7 unused array elements> helps with that.It wouldn't waste space if it made it to the second pass.