## High dimensional nearest neihghbour search

### High dimensional nearest neihghbour search

In my pattern formation algorithm I have to do several tenthousand nearest
neighbour queries. Usually I end up with a dimension of about d=50 to
d=100, the number of codewords (test vectors) is approx. k>100.000.

AFAIK this means that a kd-tree or any other dimensional splitting algorithm
is not efficient because I have too few codewords for too many dimension.
IIRC the efficiency condition is k>2^d.

I haven't found any other algorithm, neither in books nor in google. At the
moment I do exhaustive searching, which takes a lot of time (approx. 1-2
hours).

Can someone point out a link or a reference where this matter is treated?

Is there an efficient algorithm for this kind of problem?
I don't want an approximate result, I need the exact nearest neighbour!

### High dimensional nearest neihghbour search

> In my pattern formation algorithm I have to do several tenthousand nearest
> neighbour queries. Usually I end up with a dimension of about d=50 to
> d=100, the number of codewords (test vectors) is approx. k>100.000.
> Can someone point out a link or a reference where this matter is treated?

> Is there an efficient algorithm for this kind of problem?
> I don't want an approximate result, I need the exact nearest neighbour!

Problems like that arrive in biochemistry as well, and are considered difficult.
I don't know too much about, but, please, check that:
(Paris algorithmics group)

http://algo.inria.fr/seminars/sem96-97/cazals.pdf

Also, verify this reference:

S.A. Nene and S.K. Nayar, "A simple algorithm for nearest neighbor search in
high dimension spaces", IEEE Transaction on Pattern Analysis and Machine
Intelligence, vol. 19:9, pp. 989--1003, (1997).

Jerzy Karczmarczuk
Caen, France

### High dimensional nearest neihghbour search

> In my pattern formation algorithm I have to do several tenthousand nearest
> neighbour queries. Usually I end up with a dimension of about d=50 to
> d=100, the number of codewords (test vectors) is approx. k>100.000.

Yep, that's a tough one. 10^11 coordinate comparisons is a lot.

Quote:> AFAIK this means that a kd-tree or any other dimensional splitting algorithm
> is not efficient because I have too few codewords for too many dimension.
> IIRC the efficiency condition is k>2^d.

Indeed.  The first "master layer" of a k-d tree in d dimensions has
2^d leaf nodes.  So if you have a lot less than that number of points
to input, odds are they'll almost all be empty, whereas the core trick
of a (balanced) k-d tree is to have a small, non-zero and roughly
equal number of entries in every leaf node.

Are these 50 to 100 features all equally descriptive?  If they aren't,
you may want to find those that have higher separative power and use
only those to subdivide on, in a lower-dimensional tree, and revert to
brute force searching for the others.

Whatever you do, you can't store that k-d tree as an actual tree, of
course.  Use coordinate mapping, instead.  E.g. if you normalized all
coordinates to 0..1, isolate the most significant bit from each
coordinate and concatenate them all into a 50-100 bit long integer
number (simulated, if your language doesn't support arbitrary
precision).  You can then somewhat quickly extract the sum of leading
bits of all coordinate differences as

count_bits(bitwise_and(bitstring(query), bitstring(set[i])))

to speed up the search.  Starting from this you can more quickly find
likely close candidates --- which in turn helps to quickly reject
others in the brute-force search to follow.

Quote:> Is there an efficient algorithm for this kind of problem?

I doubt it --- the problem itself is inefficient by definition, I'd
say.
--

Even if all the snow were burnt, ashes would remain.

### High dimensional nearest neihghbour search

> S.A. Nene and S.K. Nayar, "A simple algorithm for nearest neighbor search
> in
>     high dimension spaces", IEEE Transaction on Pattern Analysis and
>     Machine Intelligence, vol. 19:9, pp. 989--1003, (1997).

> Jerzy Karczmarczuk
> Caen, France

Thank you, that did the job. Now I am down from a few hours to a few
minutes.

I have about 32000 fixed points in a vectorspace of dimension 100 and
higher. All my points are contained within a n-dimensional hypercube with
length = 1 in every dimension.  I want to implement some algorithm to find
to any given point in this space the nearest neighbor to it. I tried some
approaches without success:

* Voronio diagrams / Delauney triangulation - from theory the best solution
but since complexitiy of this triangulation is O(N^2) and my N is rather
high there is no way known to me which will end until I die ;-)

* KDTrees (Sedgewick et alt.)  - since the average distance between any two
points within my vectorspace is very high I found no way to construct and
search through a tree finding the nearest neighbor *without* having to go
through at least half of all my fix points

* Hotelling transformation ("Hauptachsentransformation") to reduce the
dimension of my fix points didnt bring any valuable results

* Neural network (5 Layers 96 input, 50 hidden, 30 hidden, 12 output since
I want finally to classify into 12 different classes) is capable of doing
the thing 96% right, any changes to network layout did bring any
improvement ye - need at least 99% ...

Any ideas beyond this ?

Regards
Arnold Ude