Full text searching, anyone interested?

Full text searching, anyone interested?

Post by ml » Mon, 04 Jun 2001 01:53:08



I frequently rant on this about full text searching.

The fulltextsearch package under contrib is an interesting approach. As anyone
rolling their eyes at my frequent posts, know that I am also working on a full
text search system.

The two packages offer two very different approaches.

"/contrib/fulltextsearch" uses a table of word/oid pairs to find words. Such
that a full text search query would look something like this:

select * from foo where fti_foo.string = 'word' and fti_foo.id = foo.oid;

(or something like that)

Mine has, as yet, has not been contributed. I am dubious that it will happen in
the near future for various reasons. However, it is designed around a
completely different strategy, it is more like a search engine.

Like the fti package, my system parses out a table into component words,
however, rather than storing them word/oid pairs, I use an external file format
which stores a searchable dictionary of words. Each word is associated with a
bitmap.

The bitmap contains a bit for each record. Multiple words in a query are parsed
and the bitmaps retrieved. The bitmaps are then combined with boolean
operations (and/or/not).

Each bit (0...count) is associated with an integer key. The integer key can be
an oid or some other unique table ID.

Being an external system, it is implemented using 'C' based extensions to
postgres. A query on my system looks something like:

select * from table, (select ftss_direct('all { word1 word2 }') as id) as
result, where table.id = result.id;

The advantages of fti is that it, by being tied closely to Postgres, updates
automatically on inserts and updates. The disadvantage is that it is relatively
slow to perform complex searches.

The advantage of my system is that it is very fast, a loaded system can search
the bitmap index of roughly 4 words within 4 million records in about 20ms~50ms
depending on the popularity of the words. (Performance lags at startup) The
disadvantage of my system is that it is not designed to be very dynamic, it
relies extensively on the virtual memory management of the system (and is thus
limited), and not being tied too closely to PostgreSQL, has been somewhat
frustrating to getting it working as well as I would like.

I would love to find a way to get a bitmap like index native to Postgres. I
would, in fact, love and expect to do an amount of it. The problem is to do it
"right" will require a lot of work at very low levels of Postgres.

Is anyone interested in pursuing this?
How should it look?
How deeply rooted in postgres does it need to be?

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

 
 
 

Full text searching, anyone interested?

Post by Gavin Sher » Mon, 04 Jun 2001 11:07:31


Hi guys,


> I frequently rant on this about full text searching.

[snip]
> I would love to find a way to get a bitmap like index native to Postgres. I
> would, in fact, love and expect to do an amount of it. The problem is to do it
> "right" will require a lot of work at very low levels of Postgres.

> Is anyone interested in pursuing this?

Yes. I think this would be an important feature of PostgreSQL. I have
hacked contrib/fulltextindex to bits in order to segment the index so that
I can better deploy it on a cluster of postgres machines. I have also
changed it to a score/rating style system. However, it is still a word ->
oid relationship and is not scaling as well as I had hoped.

Quote:> How should it look?

In terms of interface to SQL, the function call which activates your FTI
search is much neater than the way I do it -- build a query based on the
number of search terms and the boolean operator used.

In terms of how it is interfaced to Postgres backend, I think it should be
an index type which one can apply to any character orientated
columns(s). It would be important that the index capable of handling
multiple columns so that many textual fields in a single row could be
indexed.

The index itself is where troubles would be found. People expect a lot
from full text searches. My own implementations allow chronological
searching, score based searching and searching on similar words. It would
be hard to interface this to CREATE INDEX as well as select. So, if a
native full text index was to be build it would have to be able to:

a) Index multiple columns
b) Be configurable: score/frequency based sorting, sorting in terms of a
column in an index row?
c) Be interfaced to a user level fti() function
d) Be ignored by the planner (if we want searches to occur only through
fti())
e) honour insert/delete.

Something else which is an issue is the size of the index. Indices on text
columns are generally very large. In my applications I have managed to
reduce this through segmenting the indices along the following lines: case
sensitivity/insensitivty, leading characters. This dramatically reduces
the IO load of an index scan -- but it would be quite difficult to build
this into a dynamic framework for the backend. For one, a VACUUM or some
equivalent would need to evaluate the size of a given index and, based on
other configuration information (is the user allowing index
segmentation?) segment the index based on some criteria. The problem then
is that for large indices, this could take quite some time.

I've probably over looked a fair number of other things which would need
to be considered. However, it's safe to say that such a feature native to
Postgres would be greatly appreciated by many.

Gavin

---------------------------(end of broadcast)---------------------------


 
 
 

Full text searching, anyone interested?

Post by ml » Tue, 05 Jun 2001 02:27:22



> Hi guys,


> > I frequently rant on this about full text searching.

> [snip]
> > I would love to find a way to get a bitmap like index native to Postgres. I
> > would, in fact, love and expect to do an amount of it. The problem is to do it
> > "right" will require a lot of work at very low levels of Postgres.

> > Is anyone interested in pursuing this?

> Yes. I think this would be an important feature of PostgreSQL. I have
> hacked contrib/fulltextindex to bits in order to segment the index so that
> I can better deploy it on a cluster of postgres machines. I have also
> changed it to a score/rating style system. However, it is still a word ->
> oid relationship and is not scaling as well as I had hoped.

> > How should it look?

> In terms of interface to SQL, the function call which activates your FTI
> search is much neater than the way I do it -- build a query based on the
> number of search terms and the boolean operator used.

> In terms of how it is interfaced to Postgres backend, I think it should be
> an index type which one can apply to any character orientated
> columns(s). It would be important that the index capable of handling
> multiple columns so that many textual fields in a single row could be
> indexed.

> The index itself is where troubles would be found. People expect a lot
> from full text searches. My own implementations allow chronological
> searching, score based searching and searching on similar words. It would
> be hard to interface this to CREATE INDEX as well as select. So, if a
> native full text index was to be build it would have to be able to:

> a) Index multiple columns
> b) Be configurable: score/frequency based sorting, sorting in terms of a
> column in an index row?
> c) Be interfaced to a user level fti() function
> d) Be ignored by the planner (if we want searches to occur only through
> fti())
> e) honour insert/delete.

> Something else which is an issue is the size of the index. Indices on text
> columns are generally very large. In my applications I have managed to
> reduce this through segmenting the indices along the following lines: case
> sensitivity/insensitivty, leading characters. This dramatically reduces
> the IO load of an index scan -- but it would be quite difficult to build
> this into a dynamic framework for the backend. For one, a VACUUM or some
> equivalent would need to evaluate the size of a given index and, based on
> other configuration information (is the user allowing index
> segmentation?) segment the index based on some criteria. The problem then
> is that for large indices, this could take quite some time.

> I've probably over looked a fair number of other things which would need
> to be considered. However, it's safe to say that such a feature native to
> Postgres would be greatly appreciated by many.

Actually, I was sort of thinking.....

We could implement bitmap handling functions based on one dimentional arrays of
integers. That's how my stuff deals with them, and postgres already manages
them.

create table fubar_words (char string, bitmap int4 []);

Store various info in the bitmap.

Then we can create another table:

create table fubar_oids(bit integer, id oid);

It would work much like my system, but would use postgres to manage all the
data. It would be less efficient than the stand alone external system, but
perform much better than the fti package.

Hmmm. thinking.

------------------------
http://www.mohawksoft.com

---------------------------(end of broadcast)---------------------------

 
 
 

Full text searching, anyone interested?

Post by Teod » Thu, 07 Jun 2001 04:48:23


Quote:> > > I would love to find a way to get a bitmap like index native to Postgres. I

[skip]

Quote:> We could implement bitmap handling functions based on one dimentional arrays of
> integers. That's how my stuff deals with them, and postgres already manages
> them.

look at contrib/intarray. gist__intbig_ops is a variant of signature
tree (from each array  get bitmap signature).

Regards,
        Teodor

---------------------------(end of broadcast)---------------------------

 
 
 

1. Open source full text indexing, anyone interested?

Hi All,

I have been annoyed at the lack of string and full text handling in SQL
7.0 (yes I know it has a search service, but have you tried using it)
so I decided to create a generic plug and play set of routines for use
from T-SQL. Well, I have done just that and I am reasonably proud of
them.

My question is, would there be any interest from the group in placing
these under an open source agreement of some kind?

In return, I would require active feedback on the functions and any
optimisations that could possibly be made. This helps all of us, you
get a ready made set of functions and we all get a fully debugged
generic full text indexing engine.

I would also need some feedback on ease of use, adaptability, features
required and constructive criticism.

So, before I spend hours putting documentation together, publishing a
site and posting the results here, is there any real interest in this?


subject line of 'FTX' please. Please remove the anti spam from my email
address first.

By the way, it consists of five tables and six procedures and should
work with datasets of any size (in use on 1 million row book database).

Ryan

Sent via Deja.com http://www.deja.com/
Before you buy.

2. Pls verify www.sqldew.net

3. Want to full populate full text search through trigger

4. Transporter Crashers

5. Full-Text Search - Searching HTML

6. Append from another database

7. Boolean search using Full Text search

8. fmp5 Mac/Pc : script search date

9. Full-text searching (Microsoft Search service) goes nuts

10. Searching Word Documents with Full Text Searching

11. Need faster way to search full-text in a text column

12. Search over Text, Is full-text only solution?

13. Full Text Searches on TEXT datatype ?