Finding closest array in a set of arrays

I have two sets of arrays, one which i will call the source set S, and another

set called A. I need to match every array in A to its closest array in S. Closeness is determined by the euclidean distance if you were to consider the arrays to be n-tuple coordinates. The arrays consist of bytes, and contain a few hundred entries.

My first attempt at this problem is the most obvious one. I compared every array in A to every array in S. If S had m amount of elements and A had k amount of elements then that would be m*k comparisons. This algorithm took way to long.

My second attempt was to have every array in S have references to its nearest neighbors. My alogorithm would start at one array and move from neighbor to neighbor till it found an array that was very closes to the desired array. This had some problems because it would sometimes hit a dead end and return a poor choice for a matching array.

I am currently improving this method to prevent the dead end problem and having some luck, but it is getting more complicated then it probably should be.

In my problem S doesn't ever change and I have alot of time to set up S and analyze it, but when I start matching up arrays in A it needs to happen very fast.

If anyone has any ideas that might help my problem then please let me know. A binary tree search or something like that would be perfect, if I could only figure out a good way to build the tree.

[1460 byte] By [tbpckisaa] at [2007-10-3 11:36:30]
# 1

In effect the set S is a set of points in n-space. Knuth called them "Post

Offices" and posed the problem of finding the nearest Post Office to a

given point - which is what you want to do for each point in the other set.

The cells of points closest to each Post Office form a tessellation: the

Voronoi tessellation. There's a nice paper here:

http://www.cs.cmu.edu/~guyb/real-world/triang/drysdale.txt

which describes how this tessellation might be used to obtain a

fast search for the nearest point.

"Once we have the Voronoi diagram, we can solve the post office

problem as follows. (This is not the best solution, but it's reasonably

simple.) Draw a vertical line through each of the Voronoi points.

These lines split the plane into vertical slabs. To locate a point p,

first do a binary search to find the slab containing p. Within each

slab, there are no Voronoi points, so the Voronoi edges that cross

each slab do so nicely. To find the Voronoi region containing p, we

just do another binary search, this time on the spaces between the

Voronoi edges in our slab. Altogether, the search takes O(log n)

steps, where n is the number of sites."

(When he talks about "vertical slabs" he has the 2 dimensional case in mind.)

pbrockway2a at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 2

According to Skiena [1] Voronoi diagrams grow to large in higher dimensions.

He considers kd-trees as well, but again, they don't give you that much gain in higher dimensions.

He says: "Still, above 20 or so dimensions, you might as well do a linear search through the data points."

And you're working with a few hundred dimensions...

Sorry for not being more constructive.

[1] http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK4/NODE188.HTM#SECTION03165000000000000000

horstmeyera at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 3

Thanks for you responses.

I read up on Voronoi diagrams and they definitely don't scale well to high dimensions. The algorithm is still O(log n) at high dimensions but that is just describing the growth.

One method that I am working on right now is to construct a binary search tree. I take the two farthest points in the data set (or region) and construct two regions using those points. These regions each consist of half of the points in the parent region that are closest to one of the farthest points. I continue this method splitting up the regions until each region consists of just two points. Then build a tree from those regions to locate which region your desired point to match is in. I have not tried this method yet but it seems like it would work. It won't guarantee the closest point, just a close one because of points outside the region that might be a better choice. To resolve that problem you can use a voronoi diagram to find any points outside a region whose cells cross into the region and include those in the region.

tbpckisaa at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 4

Did you look up kd-trees? They are not so different from what you're trying to do, a binary tree data structure to partition (multidimensional) space.

>These regions each consist of half of the points in the parent region

>that are closest to one of the farthest points.

You're making a mistake there. If you take the two farthest points, there is no guarantee that exactly half of the remaining points are closer to one of them. You could, however, take the two points "in the middle" and use them to partition space which would give you equal halves. But I'd really take a look at the kd-trees...

Btw, one other thing (not really algorithmic in nature, but with impact on speed): Do not compute the euclidean distance but its square. This way you avoid the costly sqrt-operation and (more importantly) you can use ints instead of floating point data types.

horstmeyera at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 5

The basic problem is for Array A with N elements and array B with M elements find: for each element in A the closest match in B.

The Nieve way of doing this is an O(nm) operation. There is no simpler way of doing this unless you can find a way to: for each A(i) find a subset of B that contains the nearest match in less then (M) time.

kd trees, as mentioned earlier, are a good way to do this but only become faster for relatively large values of M.

see the Nearest Neightbour kd-tree http://en.wikipedia.org/wiki/Kd-tree for an approach to this.

matfud

matfuda at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 6

I just looked up kd-trees and I will probably try to implement them. The wikipedia article did not say much on the nearest neighbor problem with kd trees, such as what is it's average complexity when searching for the exact nearest neighbor.

In my attempt I knew that I could not always half a region exactly so I made it account for that. I just finished coding this algorithm and it works pretty well and fast. It would usually return the nearest point, but not always, sometimes just a close one. It also doesn't work as well in very high dimensions where the space is sparse, but it is still useable.

I am using the squared distance, but I haft to be careful with it because it doesn't work for some equations. I checked the math for everything I did and I seem to worked it out correctly (based on the results of my tests).

Thanks for your replies. I am trying to get the fastest algorithm I can and I really appreciate your help.

tbpckisaa at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 7
Will you show us your solution? I'd like to see it.
horstmeyera at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 8

Sure. I am just going to be adding a depth-first search to my algorithm (which I picked up from the kd-tree approach) and that should guarantee the closest point. Hopefully this addition won't slow down my search much.

I am currently taking finals right now, so preparing this solution will have to wait till after them.

I do have notes and rough sketches but they are a little sloppy and hard to follow.

I am considering writing an article for a math newsletter at my previous college, so that will be much more refined and in detail than what I might be able to provide right now.

The more I try to find an efficient algorithm for this problem the more interesting it seems to get. I am finding alot of applications for it I didn't even know about.

tbpckisaa at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...
# 9

certainly - learning about KD trees is a worthy pastime and I encourage you to implement the Nearest Neighbor algorithm using KD trees, but the comments about difficulty of doing the search in large dimension are not wrong and what you will find is that the KD tree in your high dimensions doesn't do much for you. You will still end up looking at the distance to nearly every point in your set. There ain't no magic bullet!

The basic idea of NN on a KD tree is to descend doing a distance computation for each node in the tree until you hit a leaf. This gives you a candidate for the NN (and not necessarily a good candidate) and then you go back up the tree looking to see if the hyper rectangular block represented by a node intersects with the hypersphere (centered at the candidate point and radius of the distance between the search point and the candidate) IF you do intersect the hyper-rectangle then possibly that node contains a point that is closer and you must descend those branches to see if either child (who has a smaller hyper rectangle) intersects the sphere.

You only win, i.e. you only get to cut off nodes, when the intersection was void at some high level and you got to remove an entire branch of the tree without looking further.

The problem that you have is this: The hyper rectangle up near the top of a kd tree, like for example the very top node. Is a split along a single dimension. In all other dimensions, you have not yet bounded the rectangle so you permit the full range of values in any other dimension. When you do your distance to the hypershpere computation, a bounded dimension tells you something but an unbounded one tells you nothing. For example suppose your input data point was something like (19, 37, 250, 3, ......) Suppose your top split point was at 100. This means that nodes whose first coordinate was less than 100 are on the left and nodes with values greater than 100 are all on the right. So is a point with a first coordinate value of 119 far away from your input point? Sure the first coordinate contributed (119-19)^2 to the total distance but you still have 99 other coordinate calculations to contribute to the total sum. It is highly unlikely that the first coordinate determines the match.

When you look at the potential for intersection between a hypersphere and a hyper rectangle, the only coordinates that contribute a positive value to the distance are the coordinates that have been bounded. All unbounded coordinates contribute zero to the sum (because an unbounded coordinate could hold any value and in particular could hold exactly the same value as the input data at that coordinate so it's contribution could be zero.) The point is just this: hyperspheres appear to be close to hyper rectangles that have unbounded coordinates.

Now consider your 100 dimensional kd-tree. The top of the tree split one dimension. His two children split the second dimension, his 4 grandchildren split the third dimension. So by the time that you have 4 billion nodes in your tree, you have only descended to 32 levels, less than one third of the dimensions bounded. EVERY single node in your tree represents a hyper-rectangle where over two thirds of the coordinates are unbounded. Because of this, unless you data is extremely lopsided, you will find that when you go back up the tree looking for intersections, you will find them everywhere. You will investigate nearly all the children and do nearly all the calculations that you would have done had you just looked at every node.

Fancy algorithms are great, but there are none to my knowledge that rescue you from the problem of increasing dimension. This is why dimension reduction is considered an important step in pattern recognition. (Pattern recognition always boiling down to having a bunch of examples of things that you want to recognize, those are your original data points - a thing that you want to classify, that would be your input point - and a metric, those would be the dimensions along which you wish to compare your data for matching.) Dimension reduction is an attempt to analyze your original data to see if it really is spread out in all 100 dimension that you started with, or if it in fact lies mostly along some smaller dimensional sub-space. Factor analysis, Cluster analysis, Principal Component analysis are all tools used to attempt dimension reduction.

Since you have specified exactly nothing about where your data comes from or what you are trying to do it is impossible to say whether any of these tools will do you any good or whether dimension reduction might be possible in your case.

As a parting comment, if you like geometric arguments, I refer you to the leech lattice as the illustrative example of how large number of dimensions cause problems for nearest neighbor matches. In 2D space, you probably know that you can arrange exactly 6 pennies around one single other penny. 6 is the kissing number in 2D space - 6 circles can all touch a single other identical circle. In 3D space you can arrange 12 spheres around a single other sphere.

What do these kissing numbers mean? Well in a sense, they mean that in 2D space, you can expect a point to have about 6 neighbors clustered nearby (assuming that you data is pretty uniformly spread out - if it is not uniformly spread out - you can possibly do a dimension reduction) In 3D space you can expect about 12 neighbor. Maybe it would be better to say, that there is no problem in 3D with having 12 neighbors that are right next door.

The leech lattice is a particularly tight packing of spheres in 24D space. It packs a whopping 196560 unit spheres around a single unit sphere. In 24 dimensions you can have one hundred and ninety six thousand next door neighbors. This means that your little Voronoi polygon around the central point could have 196,000 faces. This is why the lg n Voronoi algorithm fails for high dimensional spaces because the size of the Voronoi tessellation is growing so fast.

In short, in a high dimensional space everything can be close to everything else, and thus algorithms that try to find what you are closest to, have difficulty teasing things apart.

So that's the bad news, high dimensionality is a real problem.

However all of this discussion of the problems of high dimemsion is predicated on the assumption that your data (your initial collection of points) are uniformly distributed throughout your 100 dimensions. Real data is rarely uniform.

So when you get done with your kd-tree and find that it doesn't do that much for you and are ready to explore dimension reduction, do come back and tell us what you are doing.

marlin314a at 2007-7-15 14:04:41 > top of Java-index,Other Topics,Algorithms...