Subset sorting

Given:A sorted array of integers AAn unsorted array of integers BB is a subset of AIs there a way to improve upon the O(nlogn) bound to sort B?
[178 byte] By [Learnablea] at [2007-10-2 2:00:53]
# 1
Yes, but it doesn't use A.
YAT_Archivista at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 2
Remove in a all elements that do no exist in B.
barbywarea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 3
> Remove in a all elements that do no exist in B.This will take O(n-square) in the worst case, and the best case of O(n) is not likely.
Learnablea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 4

As stated, I would have to say no.

My arugmet is this. Here is list A

1 2 3 4 5 6 7 8 9 ...

It is a sorted list of integers. B is a subset of that list (assuming elements of B are all distinct)

How does the fact that I know the sort order for the integers help me improve the sort time for a collection of integers? The sort order for the integers is something that I know in advance of every sort that I have ever done and it has never helped me before in reducing a sort from the generic nlogn

If I assume that all of B is not distinct, then my list A expands to be something like

11111222222233333334444444

where I repeat each element by the size of list B.

I am not saying that having list A can never help you improve the sorting of list B, for indeed, if you know some information about A you can use that to help you sort B.

For example, Hash the list B and keep a count on how many of each element was in B. Given that hashing is O(1) this is O(n). Then run the list A, hashing each element to see if it was in the table and and for each element that is in the hash table you now have the multiplicity of that element so you can build the list in one single pass through list A. Since list A is at least as long as list B the total runtime of this algorithm is O(M) where M is the size of list A. if M is on the order of N you can do your sort in better than NlogN.

So, A can help improve the sort time of B if there is some restriction on the nature of A, but without some qualification, I don't think it helps.

marlin314a at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 5

If |B|=m<n then do this:

Binary search in A for each element of B. Mark it. This is O(mlogn). Now run through A and pick the marked elements. This is O(n). Together this is something like O(mlogn + n) which should be smaller than O(nlogn) for some combinations of m and n.

Harald.>

pifpafpufa at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 6

A and B are roughly of the same size with distinct values. Unfortunately, nothing else is known about A or B.

I was wondering if finding the median of B (or something similar) with insertion sort could help. I don't see how, but it could be a good way to approach the problem.

Another approach was to use an elevator algorithm (go in one direction in A till the end and start over again) - the performance would then depend on how sorted B originally was.

Learnablea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 7

It can be determined in O(n) steps whether the elevator algorithm will perform better than applying an nlogn sorting routine to B.

The idea is to somehow combine the two things mentioned in my previous reply to sort B in close to O(n) steps. (Though I think the average case will still be nlogn)

Learnablea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 8
> A and B are roughly of the same size with distinct values.Distinct values? I thought B was supposed to be a subset of A?
YAT_Archivista at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 9
> Distinct values? I thought B was supposed to be a> subset of A?Array A has no repetition of values, and B is a subset of A.
Learnablea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 10
Then see reply 4.
YAT_Archivista at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 11
> Then see reply 4.
Learnablea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 12
> > Then see reply 4.:) It was worth a try.
Learnablea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 13
> > > Then see reply 4.Doesn't the fact that the superset is much smaller in this case imply that there might exist a technique to sort B is less than nlogn? This problem is a special case of sorting, and not just looking at sorting from a different perspective.
Learnablea at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 14
Read the bit starting "For example". That's the suggestion for a O(n) algorithm in the case that |B| = O(|A|).
YAT_Archivista at 2007-7-15 19:42:08 > top of Java-index,Other Topics,Algorithms...
# 15

As mentioned above it is possible to sort integers in n*log(log(n))

time without help from the set A. This algorithm as well as the one

suggested above has little real world effect if not the B set is

extremely large. However, if your interest is academic, it is a

good idea to read what other people has done before starting to

invent. Take a look at

http://www.nada.kth.se/~snilsson/public/papers.html

and in particular

http://www.nada.kth.se/~snilsson/public/papers/sorting/

parza at 2007-7-20 17:22:41 > top of Java-index,Other Topics,Algorithms...
# 16

Thanks. The paper has some good formal information on sorting.

Though I initially posted the problem for integers, A and B both contain floating point numbers. So, signature sort is not applicable. I considered flashsort (which probably boils down to doing on interpolation search on A), but the numbers are not uniformly distributed.

Array A has 10000 elements and is fixed. B is sorted more than 2000 times during a run. During the first 200 or so iterations the size of B is around 0.7 times the size of A and during the rest of the iterations, the size of B is around 0.1 times the size of A. Currently I am using Collections.sort() but it is quite slow.

Learnablea at 2007-7-20 17:22:41 > top of Java-index,Other Topics,Algorithms...
# 17

This kind of problem tends to fall into one of following

two categories.

1. There are a huge number of elements to be sorted and

then a low run time is needed since the overhead work is

negligible.

2. There are a few elements and the simplest O(n log n)

should be chosen and optimized in respect to how to code

it because overhead is crucial.

Since you are sorting only 10000 elements, it sound like

your problem fall into category 2. The fact that you are

sorting them 2000 times does not matter since this

computation must be done in either case. I do not know

flash-sort, but the overhead must be minimum in order for

it to be faster then an algorithm like quick-sort.

parza at 2007-7-20 17:22:41 > top of Java-index,Other Topics,Algorithms...
# 18

> but the overhead must be minimum in order

> for

> it to be faster then an algorithm like quick-sort.

Thats why I am hesitating using radix sort. The precision of the sort is important to every digit, and the numbers are double precision.

Here is a good article on flashsort:

http://www.neubert.net/Flapaper/9802n.htm

Learnablea at 2007-7-20 17:22:41 > top of Java-index,Other Topics,Algorithms...
# 19

> Array A has 10000 elements and is fixed. B is sorted

> more than 2000 times during a run. During the first

> 200 or so iterations the size of B is around 0.7

> times the size of A and during the rest of the

> iterations, the size of B is around 0.1 times the

> size of A. Currently I am using Collections.sort()

> but it is quite slow.

How about telling us what you are doing so that we have a chance of helping. I can sort a 7000 long list (70% of 10000) using collections.sort 2000 times, in less than 10 seconds. This is too slow? For what? What's your hurry?

How does B change from run to run? How do you know B is a subset of A? was it sampled from A in the first place? Why are you sorting B over and over?

Give us a few hints.

marlin314a at 2007-7-20 17:22:42 > top of Java-index,Other Topics,Algorithms...
# 20

The Twos Complement Notation of floating point numbers has the property thata comparison uses the same operation in the CPU as for integer numbers. Therefore you could transform your floating point numbers into integer numbers that have the same bit representation, sort, and transform back. The additional transformation, of course, runs in O(n).

Methods to use would be Double.doubleToLongBits(double) and Double.longBitsToDouble(long)

horstmeyera at 2007-7-20 17:22:42 > top of Java-index,Other Topics,Algorithms...
# 21

> How does B change from run to run? How do you know B

> is a subset of A? was it sampled from A in the first

> place? Why are you sorting B over and over?

Thanks for the reply. Here are the details:

U : 10000 randomly generated numbers

f : nonlinear function

A : Apply f on all elements of U and sort it

B : Apply f on randomly chosen elements of U and sort it

Several instances of B are sorted more than 50000 times during a run (2000 was a really bad estimate). So, improvement in the sorting algorithm would be valuable. Generating B from A is not an option as it would change the entire process.

So, now its Collections.sort() or implementing a different sorting algorithm. (I found this link to be useful - http://www.devx.com/vb2themax/Article/19900)

Learnablea at 2007-7-20 17:22:42 > top of Java-index,Other Topics,Algorithms...
# 22
well I might as well sort set B on its own and get O(m log m), which is obviously better then O(n log n)!
AlexLamSLa at 2007-7-20 17:22:42 > top of Java-index,Other Topics,Algorithms...
# 23
> well I might as well sort set B on its own and get> O(m log m), which is obviously better then O(n log n)!Thanks for the reply. Please read reply 21 carefully.
Learnablea at 2007-7-20 17:22:42 > top of Java-index,Other Topics,Algorithms...
# 24

Still watching this topic, I see. Well if you are still interested I would suggest that you consider doing the following:

You say you have

U[k] an array of random values (I will assume double)

non-linear function f

Array A is f(([k]) sorted

Array B is subset of U[k] and you want to apply f to each element and sort.

create a class, Elt, to hold a single argument from U, the result of applying f to that element, a link to the next A Elt, and a marker int

class Elt{

double uk;;

double fofUk;

Elt next;

int marker = 0;

}

Create an array of Elts, one for each U[k], evaluate f on the U[k] element and srot that array by f of Uk. This is of course just your array A except that there is some extra stuff carried along with it.

Now thread the linked list. i.e. go throug the array of Elts and set

elt[k].next = elt[k+1]

set the last one to null.

Put elt[0] in a static location like Elt first; where you can always find it.

Now you can run the array A by following the next pointer. i.e. first is the first elt, first.next is the next elt, and so on.

Now you Hash the mess into a map. take each elt from the array and stuff into a Map the key will be the double uk, the data associated with it will be the Elt that has the value uk in it. So if you had 10000 elements in the array A you will have a HashMap that has 10000 Elts in it all indexed by their uk values.

So now you take a list B that is a subset of U[k]. Run the list B and for each U[k] that is on B look it up in the HashMap and find the Elt that has the same uk value. set the marker in that Elt to 1 (something different from the 0 that it was initialized with)

There is no need to calculate your non-linear function on the value from B because you already calcualted it once and stored it in the Elt. You just set the marker to 1 so that you know that this Elt was on list B.

When you have finished marking each element that was on B, now you go to first, the start of the A linked list, and run the list skipping every element that does not have the marker set to 1. Just read off the fofUk value. There you are - list B sorted by f.

For your second different list B, just set the marker to 2 this time - hash look up each element in B, set marker to 2, then start at first and copy the values from all Elts that have marker set to 2.

I haven't really suggested anything different from what I suggested back in my original reply, just spelled it out in more detail. Have you considered doing something like that? Did you try it? Is there some reason that you think that this method will not be superior to sorting the array and recalculating your non-linear f over and over?

marlin314a at 2007-7-20 17:22:42 > top of Java-index,Other Topics,Algorithms...
# 25
I am doing something very similar, but your technique of constructing B using the Elt class appears to use less memory and number of passes, and will be worth a try. Thanks for the reply.
Learnablea at 2007-7-20 17:22:42 > top of Java-index,Other Topics,Algorithms...