I maintain an object relational database that uses set intersections to

prioritise the result set of a user query. A search for "cat dog mouse"

results in a list of objects presented to the user in this order:

Group 1 (100% relevancy)

List of objects with exactly 3 keywords

Group 2 (66% relevancy)

List of objects with exactly 2 keywords

Group 3 (33% relevancy)

List of objects with exactly 1 keyword

The interface is HTTP and we calculate the groups using set operations.

This is how we did it.

Let Set_cat be the set of all objects with the "cat" keyword, then:

Group 1 = Set_cat INTERSECT Set_dog INTERSECT Set_mouse

Group 2 = (

(Set_cat INTERSECT Set_dog)

UNION

(Set_cat INTERSECT Set_mouse)

UNION

(Set_dog INTERSECT Set_mouse)

) DIFFERENCE Group 1

Group 3 = (Set_cat UNION Set_dog UNION Set_mouse)

DIFFERENCE Group 2

DIFFERENCE Group 1

The 'A DIFFERENCE B' notation means remove all elements of B from A.

We limited the number of keywords to eight to cap the combinatorial

explosion that results from this approach. As someone pointed out, the

calculation of the later groups is dependent on the earlier groups.

The implementation used combinations and I was able to transcribe one of the

algorithms (n pick r) directly from

PROGRAMS AND DATA STRUCTURES IN C, Second Edition

Leendert Ammeraal

Wiley

ISBN 0-471-93123-3

Chapter 3.3 p 91

This approach was made possible by the existance of a SET primitive in the

database, where each object is represented as a bit in a bitfield. We were

dealing with around 30 000 objects at a time. The position of a bit

corresponded to an internal record number in the ORDB. The ODBMS equivalent

would be the OID.

___________

Here's my notes on the combinatorial explosion in the number of SET

operations based on eight keywords.

We need to create a list of sets

Ki, i = 1,..,k where Kn is the set of objects with exactly n keywords

Creation will need to proceed recursively.

Let Sa be the set corresponding to the bth keyword i.e. the set of all

objects with the bth keyword, then:

K8 = S1 ^ S2 ^ S3 ^ S4 ^ S5 ^ S6 ^ S7 ^ S8

where ^ is the symbol for set intersection

K7 = U{8 pick 7}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5 ^ Si6 ^ Si7) - K8

where {8 pick 7) means choose 7 from 8 and Sij is the jth such picked set

K6 = U{8 pick 6}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5 ^ Si6) - K8 - K7

K5 = U{8 pick 5}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5) - K8 - K7 - K6

K4 = U{8 pick 4}(Si1 ^ Si2 ^ Si3 ^ Si4) - K8 - K7 - K6 - K5

K3 = U{8 pick 3}(Si1 ^ Si2 ^ Si3) - K8 - K7 - K6 - K5 - K4

K2 = U{8 pick 2}(Si1 ^ Si2) - K8 - K7 - K6 - K5 - K4 - K3

K1 = U{8 pick 1}(Si1) - K8 - K7 - K6 - K5 - K4 - K3 - K2

Continuing with 8 as our sample number of keywords we arrive at a figure

for the number of set operations involved by consideration of the below

table:

pick nCr #ipc #tc #u #ai TOTAL_OPS

1 8 0 0 7 7 14

2 28 1 28 27 6 61

3 56 2 112 55 5 172

4 70 3 210 69 4 283

5 56 4 224 55 3 282

6 28 5 70 27 2 99

7 8 6 48 7 1 56

8 1 7 7 0 0 7

#ipc = number of INTERSECTION ops per combination

#tc = total number of INTERSECTION ops per number picked = nCr * #ipc

#u = number of UNION operations = nCr - 1

#ai = number of DIFFERENCE operations = 8 - pick

TOTAL_OPS = #tc + #u + #ai

___________

I am now moving the system away from the object relational database to an

object database. Has anyone done this kind of thing using the Java

Collections Framework? I would be interested to hear their accounts. I'm a

bit concerned about the performance of collections versus bitfields.

Yan

Quote:> I guess I wasn't very specific with my requests for information on my query

> about the implementation of search engine text indexing.

> Its easy enough to take a gazillion documents, and then for each word, have

> a Btree index to a large set of documents that contain that particular word.

> Lets say someone wants to search for all the documents that contain: "cat

> dog mouse". Now, cat returns 1,836,939 documents, dog returns 3,670,789

> documents, and mouse returns 251,590.

> Here's an example implementation:

> One implementation of the set intersection would be to have the many large

> sets of document ID's indexed in a BTree. So in this example, there would

> be 3 BTrees, one for cat, dog, and mouse. One possible algorithm is to

> iterate over every mouse document ID and look up each document ID in the cat

> index to see if there is a match. The remaining documents that succeeded

> could be intersected with the dog index.

> The above algorithm seems rather inefficient and there must be better ways

> of implementing or structuring the data other than how I've described. So

> I'm primarily looking for indexing algorithms or structures, that are

> designed primarily for being efficient at doing very large set

> intersections. I imagine that these large set intersections not only are

> used heavily in optimizing text search engines, but in other non-text

> related searches as well.

> Ian