Matching ordered points to minimze distances
Hi folks.
I have to solve the following problem:
Given two sets P and Q of points in R^2 with |P| < |Q| find for all p in P the pairs (p, q) in P x Q, such that the sum of the distances d(p,q) is minimized.
This could probably be solved with a matching or flow algorithm, but I have a rather nasty additional restriction:
Let P and Q be ordered, i.e. P = {p_1,...,p_n} and Q = {q_1,...,q_m}, m > n.
Given two pairs (p_i,q_j) and (p_r,q_s) then for i > r, I need j > s.
My boss strongly suspects the problem to be NP-hard, so I am not interested in a optimal solution, but in a fast heuristic for a good or at least acceptable result, i.e. an algorithm that will not produce arbitrarily bad results.
Does anybody know this particular problem and a good strategy to solve it?
What is your distance measure and how do you compare points in order to sort them?
parza at 2007-7-13 14:56:42 >

> What is your distance measure
Currently, I use the standard euclidean distance d(p,q) = sqrt( (p_1-q_1)^2 + (p_2-q_2)^2 ) for points p=(p_1,p_2)^T, q=(q_1,q_2)^T in R^2.
I think the problem is independant of the applied metric, 'though I'm not required to implement a metric-independant solution.
> and how do you compare points in order to sort them?
I don't. I get two lists of points as input. The ordering of the points is implied by their position in the lists.
Since you do not have any additional information on the sorting,
my guess is also that it is NP-complete.
If you manage to find an polynomial algorithm that maps every
instance of the Minimum Point-To-Point Connection (see link below)
to at least one instance of your problem, then you have proven it
NP-complete as well as found an algorithm to approximate it with.
http://www.nada.kth.se/~viggo/wwwcompendium/node65.html
Please post what you find out on this problem to this thread.
I am curious.
parza at 2007-7-13 14:56:42 >

Thanks, I'll look into it.
I think I know how to solve it in O(n^2) given thatn~|P|~|Q| and computing the distance metric is constant.Before I spend more time on this writing the idea downI would like to know if you are still reading this topic.
parza at 2007-7-13 14:56:42 >

This is a fairly standard dynamic programming optimization problem. That means that the optimal solution will require N*M space and about that in runtime. So if your lists are each not much longer than a thousand points or so a perfect solution is tractable.
Without any information about the nature of the data sets it will be difficult to make any claims for a heruistic solution. Meaning only that it is trivial to construct data sets that match perfectly for one single specific alignment of the two lists and match arbitrarily poorly for any other alignment.
I am happy to outline the DP code, but if you are not familiar with dynamic programming it will make little sense and of course it also makes no sense to outline this method if the number of points you must deal with is too large for this technique.
I do not have any experience with DP, but I know the
basic idea. If I am not mistaken there are at least
three ways to solve this using DP which gives different
conditions and results.
1. Requires k*|P|*|Q| memory and N(n^2) runtime.
Computes the pairs and the minimum sum.
2. Requires k1*|P|+k2*|Q| memory and N(n^2) runtime.
Computes the sum.
3. Requires k1*|P|+k2*|Q| memory and N(n^3) runtime.
Computes the pairs and the minimum sum.
parza at 2007-7-13 14:56:43 >

> but I have a rather nasty additional restriction:
> Let P and Q be ordered, i.e. P = {p_1,...,p_n} and Q = {q_1,...,q_m}, m > n.
> Given two pairs (p_i,q_j) and (p_r,q_s) then for i > r, I need j > s.
I don't understand this restriction: you said the ordering was implied
by the position of the points in the list. Obviously it can't be just *any*
pair (p_i,q_j). Does that pair have a minimal distance?
kind regards,
Jos
ps. if you relax the restriction (which I don't understand), the problem
isn't in NPC, i.e. it's a simple O(m*n) problem at worst.
JosAHa at 2007-7-13 14:56:43 >

> I don't understand this restriction
Construct an matrix with |P| rows and |Q| columns where the
values in the matrix are the distances between all possible
pairs. The rows and columns should be sorted as the lists
given. Assume that the pair (p_i, q_j) has been chosen, plot
this pair in the matrix and see how this restricts other
possible pairs. This should make it easier to understand the
restriction.
parza at 2007-7-13 14:56:43 >

> I would like to know if you are still reading this topic.
Yes, I am. ;-)
> I don't understand this restriction: you said the ordering was implied
> by the position of the points in the list. Obviously it can't be just *any*
> pair (p_i,q_j). Does that pair have a minimal distance?
The restriction says, that if you draw your solution as bipartite graph G = ( P+Q, E = {pairs in the solution} ), then there are no edge crossings:
p_2
\
p_1 \
\ q_2
\
q_1
is acceptable where as
p_2
|
p_1 -+- q_2
|
q_1 is not acceptable.
> > I don't understand this restriction
>
> The restriction says, that if you draw your solution
> as bipartite graph G = ( P+Q, E = {pairs in the
> solution} ), then there are no edge crossings: [ nice ASCII art snipped ]
Ah, ok, I'm with you again. It you want to minimise the sum of all
chosen distances: sum(p_i <--> q_j) where |P| = n < m = |Q|, with
no edge crossing and if the order of the p_i and q_j is fixed, this is
quite easy to solve, because all you have to do is find all combinations
C(m, n) and pick the one with the minimum distance value from them.
kind regards,
Jos
JosAHa at 2007-7-13 14:56:43 >

> The restriction says, that if you draw your solution as bipartite
> graph G = ( P+Q, E = {pairs in the solution} ), then there are no
> edge crossings:
I understood your first definition, but now I am lost. What is the
definition of 'edge crossing' in graph theory?
If I assume that the first definition is right, then there are at
least three different solutions with different results and consumption.
Is any of following solutions acceptable?
1. Requires k*|P|*|Q| memory and N(n^2) runtime.
Computes the pairs and the minimum sum.
2. Requires k1*|P|+k2*|Q| memory and N(n^2) runtime.
Computes the sum.
3. Requires k1*|P|+k2*|Q| memory and N(n^3) runtime.
Computes the pairs and the minimum sum.
parza at 2007-7-13 14:56:44 >

As I said, it is a dynamic programming optimization problem.
The problem of aligning the points works like this:
assume you have p[] and q[]. You write the recursive routine that returns the minimum distance. It uses standard divide and conquer.
od(int ip,int iq){
// returns total distance of optimum match in the specific case
// where point ip is matched with iq or something less than iq
// first we assume ip matches iq and thus ip-1 must match something
// equal to iq-1 or less
best = p[ip].dist(q[iq]) + od(ip-1, iq-1);
// then we look at all other possibilities for the q point
for(int i = ip; i < iq; i++{ // note - q index always at least as large as p index
better = p[ip].dist(q) + od(ip-1, i-1);
if(better < best) best = better;
}
return best;
}
The recursive routine is the solution. If you run it od(n,m) it will tell you the total distance of the best match. It has only two problems 1) it is slow because you are continually reworking sub-problems and 2) while it returns the value of the optimum match it does not ever store and return the matching decisions that were made in the process of discovering that optimum.
Dynamic programming is the fix for problem 1. Read a chapter of an algorithms book or find something on the web about it. The basic idea is to rewrite od so that it will cache the result of every call to od(ip,iq) in an array and to also short cut the recursion in od by first looking in the array to see if the result has already been calculated.
The fix to the second problem is to also cache int that array the single last iq value that led to the optimal alignment for od(ip,iq) , so that when you are all done you can backtrack throught the array and see what result you got at each step. This should also be explained in any write up of dynamic programming.
I would just send you the entire solution, but it would not make sense and you would not be able to either tweek it or debug it if you don't understand DP. So after you have educated yourself on this very useful optimization technique, and have tried transforming the code which I have included if you are still having problems transforming my partial solution into the complete solution just ask for help.
Sorry - forgot the code tags and some data typing.
double od(int ip,int iq){
// returns total distance of optimum match in the specific case
// where point ip is matched with iq or something less than iq
// first we assume ip matches iq and thus ip-1 must match something
// equal to iq-1 or less
double best = p[ip].dist(q[iq]) + od(ip-1, iq-1);
// then we look at all other possibilities for the q point
for(int i = ip; i < iq; i++{ // note - q index always at least as large as p index
double better = p[ip].dist(q[i]) + od(ip-1, i-1);
if(better < best) best = better;
}
return best;
}
I agree with marlin314 that posting code will confuse more
then help. However, I would like to add that there are at
least three DP versions to consider. I think that option 4
is the most interesting solution. marlin314, you seem to have
DP experience, do you think it will perform as I have
predicted?
Option 1:
Following link is a simple tutorial of something similar to
what you are trying to do.
http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html
The algorithm above transfered to your problem will consume
O(|P|*|Q|) memory and have a runtime of O(|P|*|Q|).
However, there are at least three other options.
Option 2:
If you only need the minimum sum, you can reduce memory
consumption to O(|P|+k*|Q|) by skipping the traceback.
Option 3:
The third option gives the sum as well as the pairs and it
consumes only O(|P|+k*|Q|) memory but it has a runtime of
O(|P|^2*|Q|). The idea here is to replace the traceback with
computing the last pair without saving the path and then
reduce the problem by removing this pair and then compute
the last pair again, and so on.
Option 4:
Let k be a constant. While recursively moving through P, save
the k smallest sums at every p in P. Then when doing the
traceback, allow only the pairs which were saved, hoping that
the sum will add up to the minimum sum. If the sums are the
same, its done, else increase k and start over. The best
possible performance is O(|P|+k*|Q|) memory and a runtime of
O(|P|*|Q|). The worst performance is O(|P|*|Q|) memory and
a runtime of O(|P|^2*|Q|). The in data will decide performance,
but I believe that for most data it will be much closer to the
lower limit.
parza at 2007-7-20 12:23:51 >

Thank you all for your help on this.I will take your advice and read up on DP.