I maintain an object relational database that uses set intersections to
prioritise the result set of a user query. A search for "cat dog mouse"
results in a list of objects presented to the user in this order:
Group 1 (100% relevancy)
List of objects with exactly 3 keywords
Group 2 (66% relevancy)
List of objects with exactly 2 keywords
Group 3 (33% relevancy)
List of objects with exactly 1 keyword
The interface is HTTP and we calculate the groups using set operations.
This is how we did it.
Let Set_cat be the set of all objects with the "cat" keyword, then:
Group 1 = Set_cat INTERSECT Set_dog INTERSECT Set_mouse
Group 2 = (
(Set_cat INTERSECT Set_dog)
UNION
(Set_cat INTERSECT Set_mouse)
UNION
(Set_dog INTERSECT Set_mouse)
) DIFFERENCE Group 1
Group 3 = (Set_cat UNION Set_dog UNION Set_mouse)
DIFFERENCE Group 2
DIFFERENCE Group 1
The 'A DIFFERENCE B' notation means remove all elements of B from A.
We limited the number of keywords to eight to cap the combinatorial
explosion that results from this approach. As someone pointed out, the
calculation of the later groups is dependent on the earlier groups.
The implementation used combinations and I was able to transcribe one of the
algorithms (n pick r) directly from
PROGRAMS AND DATA STRUCTURES IN C, Second Edition
Leendert Ammeraal
Wiley
ISBN 0-471-93123-3
Chapter 3.3 p 91
This approach was made possible by the existance of a SET primitive in the
database, where each object is represented as a bit in a bitfield. We were
dealing with around 30 000 objects at a time. The position of a bit
corresponded to an internal record number in the ORDB. The ODBMS equivalent
would be the OID.
___________
Here's my notes on the combinatorial explosion in the number of SET
operations based on eight keywords.
We need to create a list of sets
Ki, i = 1,..,k where Kn is the set of objects with exactly n keywords
Creation will need to proceed recursively.
Let Sa be the set corresponding to the bth keyword i.e. the set of all
objects with the bth keyword, then:
K8 = S1 ^ S2 ^ S3 ^ S4 ^ S5 ^ S6 ^ S7 ^ S8
where ^ is the symbol for set intersection
K7 = U{8 pick 7}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5 ^ Si6 ^ Si7) - K8
where {8 pick 7) means choose 7 from 8 and Sij is the jth such picked set
K6 = U{8 pick 6}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5 ^ Si6) - K8 - K7
K5 = U{8 pick 5}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5) - K8 - K7 - K6
K4 = U{8 pick 4}(Si1 ^ Si2 ^ Si3 ^ Si4) - K8 - K7 - K6 - K5
K3 = U{8 pick 3}(Si1 ^ Si2 ^ Si3) - K8 - K7 - K6 - K5 - K4
K2 = U{8 pick 2}(Si1 ^ Si2) - K8 - K7 - K6 - K5 - K4 - K3
K1 = U{8 pick 1}(Si1) - K8 - K7 - K6 - K5 - K4 - K3 - K2
Continuing with 8 as our sample number of keywords we arrive at a figure
for the number of set operations involved by consideration of the below
table:
pick nCr #ipc #tc #u #ai TOTAL_OPS
1 8 0 0 7 7 14
2 28 1 28 27 6 61
3 56 2 112 55 5 172
4 70 3 210 69 4 283
5 56 4 224 55 3 282
6 28 5 70 27 2 99
7 8 6 48 7 1 56
8 1 7 7 0 0 7
#ipc = number of INTERSECTION ops per combination
#tc = total number of INTERSECTION ops per number picked = nCr * #ipc
#u = number of UNION operations = nCr - 1
#ai = number of DIFFERENCE operations = 8 - pick
TOTAL_OPS = #tc + #u + #ai
___________
I am now moving the system away from the object relational database to an
object database. Has anyone done this kind of thing using the Java
Collections Framework? I would be interested to hear their accounts. I'm a
bit concerned about the performance of collections versus bitfields.
Yan
Quote:> I guess I wasn't very specific with my requests for information on my query
> about the implementation of search engine text indexing.
> Its easy enough to take a gazillion documents, and then for each word, have
> a Btree index to a large set of documents that contain that particular word.
> Lets say someone wants to search for all the documents that contain: "cat
> dog mouse". Now, cat returns 1,836,939 documents, dog returns 3,670,789
> documents, and mouse returns 251,590.
> Here's an example implementation:
> One implementation of the set intersection would be to have the many large
> sets of document ID's indexed in a BTree. So in this example, there would
> be 3 BTrees, one for cat, dog, and mouse. One possible algorithm is to
> iterate over every mouse document ID and look up each document ID in the cat
> index to see if there is a match. The remaining documents that succeeded
> could be intersected with the dog index.
> The above algorithm seems rather inefficient and there must be better ways
> of implementing or structuring the data other than how I've described. So
> I'm primarily looking for indexing algorithms or structures, that are
> designed primarily for being efficient at doing very large set
> intersections. I imagine that these large set intersections not only are
> used heavily in optimizing text search engines, but in other non-text
> related searches as well.
> Ian