contrib/tsearch

contrib/tsearch

Post by Christopher Kings-Lynn » Fri, 06 Sep 2002 15:40:33



Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after
the stemming process and hence is thought to be a stopword?  This is a bug,
but how should it be fixed?

Although, tests don't support that:

usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himring';
 food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)
usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hisring';
 food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)

usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'hising';
 food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)

usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
## 'himing';
 food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)

All work...?

Chris

> -----Original Message-----


> Kings-Lynne
> Sent: Thursday, 5 September 2002 2:36 PM
> To: Hackers
> Subject: [HACKERS] contrib/tsearch

> Hi Oleg/Teodor,

> I'm sorry to keep posting bugs without patches, but I'm just
> hoping you guys
> know the answer faster than I...I know you're busy.

> What does tsearch have against the word 'herring' (as in the
> fish).  Why is
> it considered a stopword?

> Attached is example queries...

> Chris

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
 
 
 

contrib/tsearch

Post by Christopher Kings-Lynn » Fri, 06 Sep 2002 15:35:43


This is a multi-part message in MIME format.

------=_NextPart_000_0107_01C254E9.86C18DA0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Hi Oleg/Teodor,

I'm sorry to keep posting bugs without patches, but I'm just hoping you guys
know the answer faster than I...I know you're busy.

What does tsearch have against the word 'herring' (as in the fish).  Why is
it considered a stopword?

Attached is example queries...

Chris

------=_NextPart_000_0107_01C254E9.86C18DA0
Content-Type: text/plain;
        name="result.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
        filename="result.txt"

usa=3D# select food_id, brand,description,ftiidx from food_foods where desc=
ription ilike '%herring%';
 food_id |         brand          |                 description            =
      |                         ftiidx
---------+------------------------+----------------------------------------=
------+---------------------------------------------------------
   66245 | Kosher/Deli Foods      | Herring, Smoked                        =
      | 'food' 'smoke' 'kosher/deli'
   66246 | Kosher/Deli Foods      | Herring, in Sour Cream                 =
      | 'food' 'sour' 'cream' 'kosher/deli'
    4590 | - Generic -            | Fish oil, herring                      =
      | 'oil' 'fish' 'gener'
   70737 | - Average All Brands - | Finfish, herring, Pacific, raw         =
      | 'raw' 'brand' 'pacif' 'averag' 'finfish'
   70738 | - Average All Brands - | Finfish, herring, Pacific, cooked, dry =
heat  | 'dri' 'cook' 'heat' 'brand' 'pacif' 'averag' 'finfish'
   70739 | - Average All Brands - | Finfish, herring, Atlantic, raw        =
      | 'raw' 'brand' 'atlant' 'averag' 'finfish'
   70740 | - Average All Brands - | Finfish, herring, Atlantic, pickled    =
      | 'brand' 'pickl' 'atlant' 'averag' 'finfish'
   70741 | - Average All Brands - | Finfish, herring, Atlantic, kippered   =
      | 'brand' 'atlant' 'averag' 'kipper' 'finfish'
   70742 | - Average All Brands - | Finfish, herring, Atlantic, cooked, dry=
 heat | 'dri' 'cook' 'heat' 'brand' 'atlant' 'averag' 'finfish'
   66026 | German                 | Herring, Pickled: Rollmops             =
      | 'pickl' 'german' 'rollmop'
   66027 | German                 | Herring, Pickled: w/Sour Cream         =
      | 'cream' 'pickl' 'german' 'w/sour'
(11 rows)

usa=3D# select food_id, brand,description,ftiidx from food_foods where ftii=
dx ## 'herring';
NOTICE:  Query contains only stopword(s) or doesn't contain lexem(s), ignor=
ed
 food_id | brand | description | ftiidx
---------+-------+-------------+--------
(0 rows)

------=_NextPart_000_0107_01C254E9.86C18DA0
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

------=_NextPart_000_0107_01C254E9.86C18DA0--

 
 
 

contrib/tsearch

Post by Oleg Bartun » Fri, 06 Sep 2002 18:49:07



> Hmmm...thinking about it, maybe 'herring' is being reduced to 'her' after
> the stemming process and hence is thought to be a stopword?  This is a bug,
> but how should it be fixed?

It's difficult question how to use stop words. We'll see what we could
do. Probably, porter's stemming algorithm has problem here.
'herring' -> 'her'~'ring'
(I have a demo of english-russian stemmr, so you can play)
http://intra.astronet.ru/db/lingua/snowball/
I'll ask Martin Porter if there could be an error stemmer.
But I think the problem is in concept of using stop words.
Should we check for stop words before stemming or after ?
In the first case we have to collect all forms of stop-words which is doable
but difficult to maintain, in latter - we'll have current problem.

It's time for beta1 and I'm not sure if we could work on this issue
right now, but I feel a big pressure from tsearch users :-)
If people want to help us why not to work on stop words list including
all forms ? In any case, we are not native  english, so don't expect we'll
create more or less decent list. Programming changes are trivial, probably
we'll end for the moment just using compile time option.
As always, your patches are welcome !

btw, you may test your queries much easier:

list=# select 'herring'::mquery_txt;
ERROR:  Your query contained only stopword(s), ignored
list=# select 'herring'::query_txt;
 query_txt
-----------
 'herring'
(1 row)

> Although, tests don't support that:

> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'himring';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)
> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'hisring';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)

> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'hising';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)

> usa=# select food_id, brand,description,ftiidx from food_foods where ftiidx
> ## 'himing';
>  food_id | brand | description | ftiidx
> ---------+-------+-------------+--------
> (0 rows)

> All work...?

> Chris

> > -----Original Message-----


> > Kings-Lynne
> > Sent: Thursday, 5 September 2002 2:36 PM
> > To: Hackers
> > Subject: [HACKERS] contrib/tsearch

> > Hi Oleg/Teodor,

> > I'm sorry to keep posting bugs without patches, but I'm just
> > hoping you guys
> > know the answer faster than I...I know you're busy.

> > What does tsearch have against the word 'herring' (as in the
> > fish).  Why is
> > it considered a stopword?

> > Attached is example queries...

> > Chris

> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

 
 
 

contrib/tsearch

Post by Martin Port » Sat, 07 Sep 2002 01:31:53


Oleg,

The Porter stemming stems herring and herrings to her, which is a bit
unfortunate. A quick fix is to put 'herring/herrings' in the exception list
in the english (porter2) stemmer, but I'll look at this case over the next
few days and see if I can come up with something a bit better.

Interesting that no one has reported this before.

Martin

---------------------------(end of broadcast)---------------------------

 
 
 

contrib/tsearch

Post by Oleg Bartun » Sat, 07 Sep 2002 02:12:53



> Oleg,

> The Porter stemming stems herring and herrings to her, which is a bit
> unfortunate. A quick fix is to put 'herring/herrings' in the exception list
> in the english (porter2) stemmer, but I'll look at this case over the next
> few days and see if I can come up with something a bit better.

Unfrtunately, we wrote tsearch module before the Snowball project has started,
so we used one implementation we found in the net (www.muscat.com) and
there is no exception list. OpenFTS uses snowball stemming, so we'd like
to have a fix. I think we have enough arguments to use snowball stemmers
in tsearch also.

Quote:

> Interesting that no one has reported this before.

:-) Thanks Cristopher for his persistence.

Quote:

> Martin

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command

 
 
 

contrib/tsearch

Post by Christopher Kings-Lynn » Sat, 07 Sep 2002 13:20:02


Quote:> Looking at the list of stopwords you sent me, Oleg, there are only about 1
> out of the list of 120 stopwords that need to have all word forms
> added.  I
> also don't think it'll be a maintenance problem.  The reason I
> think this is
> because stopwords in general don't have different word forms.

Actually, it just occurred to me that stuff like:

will
won't
it
it's
where
where's

Will all have to be in the list, right?

Chris

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate

message can get through to the mailing list cleanly

 
 
 

contrib/tsearch

Post by Christopher Kings-Lynn » Sat, 07 Sep 2002 13:59:46


There also seems to be a more complete list of english stopwords here:

http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/

However this list again does not include contractions.  I can take this
list, check it and submit it to you Oleg, but do you want me to add
contractions?

eg. wasn't, isn't, it's, etc.?

Chris

> -----Original Message-----


> Kings-Lynne
> Sent: Friday, 6 September 2002 12:20 PM
> To: Christopher Kings-Lynne; Oleg Bartunov

> Subject: Re: [HACKERS] contrib/tsearch

> > Looking at the list of stopwords you sent me, Oleg, there are
> only about 1
> > out of the list of 120 stopwords that need to have all word forms
> > added.  I
> > also don't think it'll be a maintenance problem.  The reason I
> > think this is
> > because stopwords in general don't have different word forms.

> Actually, it just occurred to me that stuff like:

> will
> won't
> it
> it's
> where
> where's

> Will all have to be in the list, right?

> Chris

> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate

> message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
 
 
 

contrib/tsearch

Post by Christopher Kings-Lynn » Sat, 07 Sep 2002 13:04:24


Quote:> Should we check for stop words before stemming or after ?

I think you should.

Quote:> In the first case we have to collect all forms of stop-words
> which is doable
> but difficult to maintain, in latter - we'll have current problem.

Looking at the list of stopwords you sent me, Oleg, there are only about 1
out of the list of 120 stopwords that need to have all word forms added.  I
also don't think it'll be a maintenance problem.  The reason I think this is
because stopwords in general don't have different word forms.

eg. her, his, i, and, etc.  They don't have different forms.  In fact, the
_only_ word in the stopword list that needs a different form is yourself and
yourselves.  Actually, according to dictionary.com 'ourself' is also a word.
'themself' isn't tho.  Some others I don't know about are:

'veri' - I assume this is stemmed 'very', so why not just use 'very'?

So, why don't you change tsearch to check for stop words _before_ stemming?
I can give you a list of revised stopwords that haven't been stemmed, with
all forms of the words.

Quote:> It's time for beta1 and I'm not sure if we could work on this issue
> right now, but I feel a big pressure from tsearch users :-)
> If people want to help us why not to work on stop words list including
> all forms ? In any case, we are not native  english, so don't expect we'll
> create more or less decent list. Programming changes are trivial, probably
> we'll end for the moment just using compile time option.
> As always, your patches are welcome !

I'm happy to work on the list of stopwords for you, Oleg.  I agree this
might be 7.4 thing though...

Chris

---------------------------(end of broadcast)---------------------------

 
 
 

contrib/tsearch

Post by Oleg Bartun » Sat, 07 Sep 2002 18:48:30



> > Looking at the list of stopwords you sent me, Oleg, there are only about 1
> > out of the list of 120 stopwords that need to have all word forms
> > added.  I
> > also don't think it'll be a maintenance problem.  The reason I
> > think this is
> > because stopwords in general don't have different word forms.

> Actually, it just occurred to me that stuff like:

> will
> won't
> it
> it's
> where
> where's

> Will all have to be in the list, right?

right, see my previous message. Teodor is our main developer, he should be
back from vacation very soon. But he already has many assignments regarding
our main project. Are there one smart programmer ?

Quote:

> Chris

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------

 
 
 

contrib/tsearch

Post by Oleg Bartun » Sat, 07 Sep 2002 18:43:32



> There also seems to be a more complete list of english stopwords here:

> http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/

Chris, I think we have to separate stop word list from tsearch package and
supply just some defaults. The reason for this is to let user decide what is
a stop word - various domains should have different stop words.
This is how OpenFTS works.
Also, we probably need to let user decide when to check for stop word -
after or before stemming. I'm waiting for Martin's fix for english stemmerr
and probably we'll switch to use snowball one, which are more qualified.

Damn, we wanted to do these and much more a bit later because we're under
big pressure of our work. We'll see if we could manage our plans.

We certainly need developers to help us in full text searching,
ltree ( it has a chance to support XML ). Also we need to work
on adding concurrency support to GiST.

so, I couldn't promise we'll work on tsearch right now, but we provide
makedict.pl so you could build dictionary with custom list of stop words.
Did you try it ?

Quote:

> However this list again does not include contractions.  I can take this
> list, check it and submit it to you Oleg, but do you want me to add
> contractions?

> eg. wasn't, isn't, it's, etc.?

Hmm, our parser isn't smart to handle them as a single word, so
it'll not helps:


wasn't
lexeme:wasn:1:Latin word
lexeme:':12:Space symbols
lexeme:t:1:Latin word

But, you always could add 'wasn', 'isn' ... and 't','s' to list of your
stop words and be happy. Hmm, probably we could enhance our parser to
handle such words too.

Anyway, most problems just a question of time we don't have :-(

> Chris

> > -----Original Message-----


> > Kings-Lynne
> > Sent: Friday, 6 September 2002 12:20 PM
> > To: Christopher Kings-Lynne; Oleg Bartunov

> > Subject: Re: [HACKERS] contrib/tsearch

> > > Looking at the list of stopwords you sent me, Oleg, there are
> > only about 1
> > > out of the list of 120 stopwords that need to have all word forms
> > > added.  I
> > > also don't think it'll be a maintenance problem.  The reason I
> > > think this is
> > > because stopwords in general don't have different word forms.

> > Actually, it just occurred to me that stuff like:

> > will
> > won't
> > it
> > it's
> > where
> > where's

> > Will all have to be in the list, right?

> > Chris

> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate

> > message can get through to the mailing list cleanly

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------

 
 
 

contrib/tsearch

Post by Oleg Bartun » Sat, 07 Sep 2002 18:54:57



> > Should we check for stop words before stemming or after ?

> I think you should.

> > In the first case we have to collect all forms of stop-words
> > which is doable
> > but difficult to maintain, in latter - we'll have current problem.

> Looking at the list of stopwords you sent me, Oleg, there are only about 1
> out of the list of 120 stopwords that need to have all word forms added.  I
> also don't think it'll be a maintenance problem.  The reason I think this is
> because stopwords in general don't have different word forms.

> eg. her, his, i, and, etc.  They don't have different forms.  In fact, the
> _only_ word in the stopword list that needs a different form is yourself and
> yourselves.  Actually, according to dictionary.com 'ourself' is also a word.
> 'themself' isn't tho.  Some others I don't know about are:

> 'veri' - I assume this is stemmed 'very', so why not just use 'very'?

That's because we currently check for stop word after stemming and
I think porters algorithm converts 'very' to 'veri' :-)

Quote:

> So, why don't you change tsearch to check for stop words _before_ stemming?
> I can give you a list of revised stopwords that haven't been stemmed, with
> all forms of the words.

I agree that english list is, probably, easy to maintain, but what about
other languages ? We don't have any volunteers - you're the first one.

Quote:> > It's time for beta1 and I'm not sure if we could work on this issue
> > right now, but I feel a big pressure from tsearch users :-)
> > If people want to help us why not to work on stop words list including
> > all forms ? In any case, we are not native  english, so don't expect we'll
> > create more or less decent list. Programming changes are trivial, probably
> > we'll end for the moment just using compile time option.
> > As always, your patches are welcome !

> I'm happy to work on the list of stopwords for you, Oleg.  I agree this
> might be 7.4 thing though...

We always could keep updates separately on our page and in CVS.

Quote:

> Chris

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)

phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

 
 
 

contrib/tsearch

Post by Teodor Siga » Tue, 10 Sep 2002 23:28:23


Quote:> Should we check for stop words before stemming or after ?

Current implementation supports both variants. Look dictionary interface
definition in morph.c:

typedef struct
{
         char            localename[NAMEDATALEN];
         /* init dictionary */
         void       *(*init) (void);
         /* close dictionary */
         void            (*close) (void *);
         /* find in dictionary */
         char       *(*lemmatize) (void *, char *, int *);
         int                     (*is_stoplemm) (void *, char *, int);
         int                     (*is_stemstoplemm) (void *, char *, int);

Quote:}       DICT;

'is_stoplemm'  method is called before 'lemmtize' and 'is_stemstoplemm' after.
dict/porter_english.dct at the end:
TABLE_DICT_START
         "C",
         setup_english_stemmer,
         closedown_english_stemmer,
         engstemming,
         NULL,
         is_stopengword
TABLE_DICT_END

dict/russian_stemming.dct:
TABLE_DICT_START
         "ru_RU.KOI8-R",
         NULL,
         NULL,
         ru_RUKOI8R_stem,
         ru_RUKOI8R_is_stopword,
         NULL
TABLE_DICT_END

So english stemmer defines is lexem stop or not after stemming, but russian before.

--
Teodor Sigaev

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

 
 
 

1. pgsql-server/contrib/tsearch README.tsearch

CVSROOT:        /cvsroot
Module name:    pgsql-server

Modified files:
        contrib/tsearch: README.tsearch

Log message:
        please apply small patch for README.tsearch.

        I've documented space usage and using CLUSTER command

        Oleg Bartunov

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

2. [BDE3 AND BDE4]

3. pgsql/contrib/tsearch tsearch.sql.in

4. NY/NJ - Oracle/Sybase on NT and Unix clients

5. pgsql-server/contrib/tsearch README.tsearch de ...

6. US-BOSTON AND SUBURBS PERM. ORACLE POSITIONS AVAILABLE-RECRUITER

7. pgsql-server/contrib/tsearch README.tsearch gi ...

8. Simple Question: Fragmentation on data & indexes

9. Alpha-2 of contrib/tsearch

10. pgsql-server/contrib/tsearch txtidx.c

11. pgsql/contrib/tsearch/makedict

12. new release of contrib/tsearch V2

13. pgsql/contrib/tsearch/dict Tag: REL7_2_STABLE ...