lage set intersecting (was: search engine text indexing?)

lage set intersecting (was: search engine text indexing?)

Post by Ian Uprigh » Sun, 12 Nov 2000 04:00:00



I guess I wasn't very specific with my requests for information on my query
about the implementation of search engine text indexing.

Its easy enough to take a gazillion documents, and then for each word, have
a Btree index to a large set of documents that contain that particular word.

Lets say someone wants to search for all the documents that contain: "cat
dog mouse".  Now, cat returns 1,836,939 documents, dog returns 3,670,789
documents, and mouse returns 251,590.  

Here's an example implementation:

One implementation of the set intersection would be to have the many large
sets of document ID's indexed in a BTree.  So in this example, there would
be 3 BTrees, one for cat, dog, and mouse.  One possible algorithm is to
iterate over every mouse document ID and look up each document ID in the cat
index to see if there is a match.  The remaining documents that succeeded
could be intersected with the dog index.

The above algorithm seems rather inefficient and there must be better ways
of implementing or structuring the data other than how I've described.  So
I'm primarily looking for indexing algorithms or structures, that are
designed primarily for being efficient at doing very large set
intersections.  I imagine that these large set intersections not only are
used heavily in optimizing text search engines, but in other non-text
related searches as well.

Ian

 
 
 

lage set intersecting (was: search engine text indexing?)

Post by anh » Mon, 13 Nov 2000 09:02:47


I've noticed search engines refine the number of search results as one
progresses through the results.  My best guess is that they are doing
exactly what you described.  I.E. They are dynamically building the
interesection as the user iterates through it.

Aaron


> I guess I wasn't very specific with my requests for information on my query
> about the implementation of search engine text indexing.

> Its easy enough to take a gazillion documents, and then for each word, have
> a Btree index to a large set of documents that contain that particular word.

> Lets say someone wants to search for all the documents that contain: "cat
> dog mouse".  Now, cat returns 1,836,939 documents, dog returns 3,670,789
> documents, and mouse returns 251,590.

> Here's an example implementation:

> One implementation of the set intersection would be to have the many large
> sets of document ID's indexed in a BTree.  So in this example, there would
> be 3 BTrees, one for cat, dog, and mouse.  One possible algorithm is to
> iterate over every mouse document ID and look up each document ID in the cat
> index to see if there is a match.  The remaining documents that succeeded
> could be intersected with the dog index.

> The above algorithm seems rather inefficient and there must be better ways
> of implementing or structuring the data other than how I've described.  So
> I'm primarily looking for indexing algorithms or structures, that are
> designed primarily for being efficient at doing very large set
> intersections.  I imagine that these large set intersections not only are
> used heavily in optimizing text search engines, but in other non-text
> related searches as well.

> Ian


 
 
 

lage set intersecting (was: search engine text indexing?)

Post by akmal b chaudhr » Mon, 13 Nov 2000 04:00:00


Ian:

If you check some of the references given by people, you may find what you
are looking for. Note that B-Trees may not necessarily be used for this
type of searching. Here, for example, are some references I've taken from
Andrea Garratt's paper:

BAYER R., MCCREIGHT E., "Organisation and maintenance of large ordered
indexes", Acta Informatica, 1(3), pp. 173-189, 1972.

FALOUTSOS C., "Signature files", In: Frakes W.B., Baeza-Yates R. (eds.)
Information retrieval data structures and algorithms, Prentice Hall, New
Jersey, pp. 44-65, 1992

GONNET G. H., BAEZA-YATES R.A., SNIDER T., "New indices for text: PAT
trees and PAT arrays", In: Frakes W.B., Baeza-Yates R. (eds.) Information
retrieval data structures and algorithms, Prentice Hall, New Jersey, pp.
66-82, 1992

JACKSON M.S., BURDEN J.P.H., "WWLib-TNG - new directions in search engine
technology", IEE Informatics Colloquium: Lost in the Web, pp. 10/1-10/8,
1999

ZOBEL J., MOFFAT A., RAMAMOHANARAO K., "Inverted files versus signature
files for text indexing", ACM Transactions on Database Systems, 23(4), pp.
453-490, 1998.

HTH

akmal

Quote:> I guess I wasn't very specific with my requests for information on my query
> about the implementation of search engine text indexing.

> Its easy enough to take a gazillion documents, and then for each word, have
> a Btree index to a large set of documents that contain that particular word.

> Lets say someone wants to search for all the documents that contain: "cat
> dog mouse".  Now, cat returns 1,836,939 documents, dog returns 3,670,789
> documents, and mouse returns 251,590.  

> Here's an example implementation:

> One implementation of the set intersection would be to have the many large
> sets of document ID's indexed in a BTree.  So in this example, there would
> be 3 BTrees, one for cat, dog, and mouse.  One possible algorithm is to
> iterate over every mouse document ID and look up each document ID in the cat
> index to see if there is a match.  The remaining documents that succeeded
> could be intersected with the dog index.

> The above algorithm seems rather inefficient and there must be better ways
> of implementing or structuring the data other than how I've described.  So
> I'm primarily looking for indexing algorithms or structures, that are
> designed primarily for being efficient at doing very large set
> intersections.  I imagine that these large set intersections not only are
> used heavily in optimizing text search engines, but in other non-text
> related searches as well.

> Ian

--
[ ---- akmal at soi.city.ac.uk ---- ]
[ http://www.soi.city.ac.uk/~akmal/ ]
 
 
 

lage set intersecting (was: search engine text indexing?)

Post by Markus Schuman » Mon, 13 Nov 2000 04:00:00


Ian,

I give you a brief idea about a possible algorithm to index VL databases for
full text searches.

Akmal gave a good reference:

FALOUTSOS C., "Signature files", In: Frakes W.B., Baeza-Yates R. (eds.)
Information retrieval data structures and algorithms, Prentice Hall, New
Jersey, pp. 44-65, 1992

I think signature files is something you should look into.

How does it work to index a VL database for full text searches using
signature files?
Signatures a large bit strings (e.g. 200 bits).

INDEX CREATION

1. you partition the documents into smaller units ... let's say pages.
2. you remove all non significant words ("the", "and", "is" ... etc) using a
dictionary approach.
3. choose the length of the bit string
4. you hash each word into the a bit string
5. you logical OR all bit strings of a unit (page) into one bit string

in step 3-4 is important ...

step 3:
You need a big enough size of the bitstring in order NOT to end up with all
1's after step 5,
otherwise the bit string is useless.

step 4:
your hash function has to be good enough to spread nicely among your bit
string and not
to create to many collisions (... as usual)

INDEX SEARCH

A search is very simple ...
let's say you search for: "object", "database", "large", "index"

a. you hash each word into a bit string using the same approach as in step 4
(creation)
b. you logical OR the bitstring of "object", "database", "large", "index"
c. you search all your existing bitstring of your database by ANDing them
with the result
of step b.
d. you linear search all units (pages) that gave a TRUE in step c.

important ...

you have set up the system in away that you get only very few false hits in
c, by
choosing the hash function, length of the bit string and setting the unit
(here page).

There are a lot of different way to organize the bit strings of your
database:
    - multi level bit strings
    - partitioned approaches
    - etc.

The is a very quick intro into the topic of signature files ... I personally
think that it
is a very interesting topic.
My wild guess is that quite a few commercial products are based a similar
schemas.
But nobody really likes to disclose this internal details ... I wouldn't
too.

Markus.


Quote:> I guess I wasn't very specific with my requests for information on my
query
> about the implementation of search engine text indexing.

> Its easy enough to take a gazillion documents, and then for each word,
have
> a Btree index to a large set of documents that contain that particular
word.

> Lets say someone wants to search for all the documents that contain: "cat
> dog mouse".  Now, cat returns 1,836,939 documents, dog returns 3,670,789
> documents, and mouse returns 251,590.

> Here's an example implementation:

> One implementation of the set intersection would be to have the many large
> sets of document ID's indexed in a BTree.  So in this example, there would
> be 3 BTrees, one for cat, dog, and mouse.  One possible algorithm is to
> iterate over every mouse document ID and look up each document ID in the
cat
> index to see if there is a match.  The remaining documents that succeeded
> could be intersected with the dog index.

> The above algorithm seems rather inefficient and there must be better ways
> of implementing or structuring the data other than how I've described.  So
> I'm primarily looking for indexing algorithms or structures, that are
> designed primarily for being efficient at doing very large set
> intersections.  I imagine that these large set intersections not only are
> used heavily in optimizing text search engines, but in other non-text
> related searches as well.

> Ian

 
 
 

lage set intersecting (was: search engine text indexing?)

Post by Tim Mill » Tue, 14 Nov 2000 04:00:00



Quote:> I guess I wasn't very specific with my requests for information on my
query
> about the implementation of search engine text indexing.

> Its easy enough to take a gazillion documents, and then for each word,
have
> a Btree index to a large set of documents that contain that particular
word.

> Lets say someone wants to search for all the documents that contain: "cat
> dog mouse".  Now, cat returns 1,836,939 documents, dog returns 3,670,789
> documents, and mouse returns 251,590.

Typically, an IR system would use inverted lists.  For each term (cat, dog,
mouse)
an list of document IDs containing that term is built at index time.  This
list
may also contain positions of words within documents if phrase searching is
required.
If these lists are stored sorted by document ID, to find those documents
containing
more than one term, simply walk down each list looking for matching document
IDs.
This can be implemented very efficiently.  Take a look at:

(1995), Brown, Eric, "Execution Performance Issues in Full-Text Information
Retrieval," Ph.D. dissertation, UMass Technical Report TR95-81 .

http://ciir.cs.umass.edu/pubfiles/UM-CS-1995-081.pdf

 
 
 

lage set intersecting (was: search engine text indexing?)

Post by Janek Boguck » Wed, 15 Nov 2000 04:00:00


I maintain an object relational database that uses set intersections to
prioritise the result set of a user query. A search for "cat dog mouse"
results in a list of objects presented to the user in this order:

Group 1 (100% relevancy)
    List of objects with exactly 3 keywords

Group 2 (66% relevancy)
    List of objects with exactly 2 keywords

Group 3 (33% relevancy)
    List of objects with exactly 1 keyword

The interface is HTTP and we calculate the groups using set operations.

This is how we did it.

Let Set_cat be the set of all objects with the "cat" keyword, then:

Group 1 = Set_cat INTERSECT Set_dog INTERSECT Set_mouse

Group 2 = (
            (Set_cat INTERSECT Set_dog)
            UNION
            (Set_cat INTERSECT Set_mouse)
            UNION
            (Set_dog INTERSECT Set_mouse)
          )  DIFFERENCE Group 1

Group 3 = (Set_cat UNION Set_dog UNION Set_mouse)
            DIFFERENCE Group 2
            DIFFERENCE Group 1

The 'A DIFFERENCE B' notation means remove all elements of B from A.

We limited the number of keywords to eight to cap the combinatorial
explosion that results from this approach. As someone pointed out, the
calculation of the later groups is dependent on the earlier groups.

The implementation used combinations and I was able to transcribe one of the
algorithms (n pick r) directly from

  PROGRAMS AND DATA STRUCTURES IN C, Second Edition
  Leendert Ammeraal
  Wiley
  ISBN 0-471-93123-3
  Chapter 3.3 p 91

This approach was made possible by the existance of a SET primitive in the
database, where each object is represented as a bit in a bitfield. We were
dealing with around 30 000 objects at a time. The position of a bit
corresponded to an internal record number in the ORDB. The ODBMS equivalent
would be the OID.

___________

Here's my notes on the combinatorial explosion in the number of SET
operations based on eight keywords.

  We need to create a list of sets
  Ki, i = 1,..,k where Kn is the set of objects with exactly n keywords

  Creation will need to proceed recursively.
  Let Sa be the set corresponding to the bth keyword i.e. the set of all
  objects with the bth keyword, then:

  K8 = S1 ^ S2 ^ S3 ^ S4 ^ S5 ^ S6 ^ S7 ^ S8
  where ^ is the symbol for set intersection

  K7 = U{8 pick 7}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5 ^ Si6 ^ Si7) - K8
  where {8 pick 7) means choose 7 from 8 and Sij is the jth such picked set

  K6 = U{8 pick 6}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5 ^ Si6) - K8 - K7

  K5 = U{8 pick 5}(Si1 ^ Si2 ^ Si3 ^ Si4 ^ Si5) - K8 - K7 - K6

  K4 = U{8 pick 4}(Si1 ^ Si2 ^ Si3 ^ Si4) - K8 - K7 - K6 - K5

  K3 = U{8 pick 3}(Si1 ^ Si2 ^ Si3) - K8 - K7 - K6 - K5 - K4

  K2 = U{8 pick 2}(Si1 ^ Si2) - K8 - K7 - K6 - K5 - K4 - K3

  K1 = U{8 pick 1}(Si1) - K8 - K7 - K6 - K5 - K4 - K3 - K2

Continuing with 8 as our sample number of keywords we arrive at a figure
for the number of set operations involved by consideration of the below
table:

  pick     nCr     #ipc    #tc   #u    #ai   TOTAL_OPS
  1          8        0      0    7      7       14
  2         28        1     28   27      6       61
  3         56        2    112   55      5      172
  4         70        3    210   69      4      283
  5         56        4    224   55      3      282
  6         28        5     70   27      2       99
  7          8        6     48    7      1       56
  8          1        7      7    0      0        7

  #ipc = number of INTERSECTION ops per combination
  #tc = total number of INTERSECTION ops per number picked = nCr * #ipc
  #u = number of UNION operations = nCr - 1
  #ai = number of DIFFERENCE operations = 8 - pick
  TOTAL_OPS = #tc + #u + #ai

___________

I am now moving the system away from the object relational database to an
object database. Has anyone done this kind of thing using the Java
Collections Framework? I would be interested to hear their accounts. I'm a
bit concerned about the performance of collections versus bitfields.

Yan



Quote:> I guess I wasn't very specific with my requests for information on my query
> about the implementation of search engine text indexing.

> Its easy enough to take a gazillion documents, and then for each word, have
> a Btree index to a large set of documents that contain that particular word.

> Lets say someone wants to search for all the documents that contain: "cat
> dog mouse".  Now, cat returns 1,836,939 documents, dog returns 3,670,789
> documents, and mouse returns 251,590.

> Here's an example implementation:

> One implementation of the set intersection would be to have the many large
> sets of document ID's indexed in a BTree.  So in this example, there would
> be 3 BTrees, one for cat, dog, and mouse.  One possible algorithm is to
> iterate over every mouse document ID and look up each document ID in the cat
> index to see if there is a match.  The remaining documents that succeeded
> could be intersected with the dog index.

> The above algorithm seems rather inefficient and there must be better ways
> of implementing or structuring the data other than how I've described.  So
> I'm primarily looking for indexing algorithms or structures, that are
> designed primarily for being efficient at doing very large set
> intersections.  I imagine that these large set intersections not only are
> used heavily in optimizing text search engines, but in other non-text
> related searches as well.

> Ian

 
 
 

lage set intersecting (was: search engine text indexing?)

Post by Ian Uprigh » Fri, 17 Nov 2000 04:00:00



>(1995), Brown, Eric, "Execution Performance Issues in Full-Text Information
>Retrieval," Ph.D. dissertation, UMass Technical Report TR95-81 .

>http://ciir.cs.umass.edu/pubfiles/UM-CS-1995-081.pdf

Fantastic reference.. Thanks a bunch!!

IaN

 
 
 

1. Configuring MS Search Engine or Full Text Indexing

If the answer to the below is already posted... please point me to the
posting.

Inside a table, I need to store text in RTF(Rich Text Format) and take
advantage of full text indexing. I would like the search engine to use
the "\" character as a delimiter in the same manner as it uses the
"."character.  For example, if you have a column that is using Full
Text Indexing in SQL Server 2000 and the string in said column is
"Rich Text.", when you do a search using the string "Rich Text" you
will get a hit for the EXACT word search.  The search engine knows
that the period is not part of the last word.  Now, if I want to store
the data in RTF it looks something like this "\ul\b rich
text\ulnone\b0", if you search the field using the string "Rich Text"
you will NOT get a hit on the EXACT phrase "rich text" unless you
include the wild card "*" on the end of the search string.  Finally my
question.  Does anybody know how to configure the search engine or
full text indexing so that the "\" character will be used the same way
as a white space or the "." character?

I appreciate any helpful responses.

JB

2. Any performance strategies on how to Optimze reading FoxPro files from VB

3. Referntial Integrity can't be defined

4. How can I redirect the Oracle Report Output to EXCELL FILE

5. WebHosting MSSQL Help!

6. Nested Sets and Intersecting Sets

7. How to update a very lage table with a set of rows

8. Search engine like behaviour for result sets?

9. Preventing Search Engines from Indexing Contents of DropDown Menu

10. full text search engine