Full Text Density

Full Text Density

Post by Jeremy H » Thu, 01 Nov 2001 05:07:24



We have a full text index that indexes over 13 million
rows.  We have change tracking enabled and it updates in
the background.  Which works wonderfully.

We have recently modified our stored procedure that does
the searching through our database so that we can specify
words that are just too expensive to bother searching on,
the procedure will simply skip them.

My question is this, is there a way to get a "report" or
something of that nature from the full text indexer so we
can more easily identify which of these words has the most
hits? For example, we have a field in the full text index
called company name, it would not make sense, and would be
a waste of resources to search on the word "company" in
that field, as it would bring back far too many results
for it to be useful.

I know some of you might suggest adding these words to the
noise word file, and indeed we are, but I am trying to
find out an easier way to identify these words.

Thank you in advance for you time!

Jeremy Hinkle
Austin-Tetra

 
 
 

Full Text Density

Post by Hilary Cotte » Thu, 01 Nov 2001 11:44:22


I suggest you limit your results set using the top operator and
containstable or freetexttable.  This will only return a subset of the total
matches.

There is no published way to get the data you are looking for.  What I
suggest you do is to write a perl script that counts words in your table,
and then order then by occcurence.  I have done this and my noise word list
is in excess of 14,000 words.  Doing this I find that indexing takes just as
long as before, but I have dramatic increases in speed while querying
(especially with freetext).  I also have noticed that the catalog size is
significantly smaller.

Here is the perl script that I use.  I run this in a directory where I have
spat all the rows out row by row into seperate text files (one text file per
row).  There is no real reason why you don't push everything into a single
text file and then run the word_count.pl script against it, however with
14,000,000 text files, you can easily see that you have 1 text file for each
row.  Here is the script.

#make sure you run this inside the directory containing your files.


{

print $_."\n";

$file=$_;

open(FILE,$file);

while (<FILE>) {

s/-\n//g;

tr/A-Z/a-z/;


$wordcount{$word}++; # count the words

Quote:}
}
};

# print out the totals sorted in alphabetical order

open(FILE, ">c:\\out.out");

foreach $word (sort by_count keys(%wordcount)) {

printf "%8d\t\t%s\n", $wordcount{$word}, $word;

printf FILE "%8d\t\t%s\n", $wordcount{$word}, $word;

Quote:}

sub by_count

{

$wordcount{$b}<=>$wordcount{$a};

Quote:}

You will end up with an file name out.out in the root of your c drive that
will look something like this (this is a snipped from a sample I ran):

    3134                       the

    1779                       sql

    1699                       server

    1187                       odbc

     998                        driver

     826                        version

     703                        to

     676                        with

     626                        of

     618                        or

     583                        in

     497                        3.7

     480                        is

     453                        occurs

     453                        times

     431                        a

     362                        statement

     333                        for

As you can see a couple of good candidates are the, to, with, of, or, in, 3,
7, is, occurs, times, a, statement, for, etc.  Most of these are already in
the noise word list that shipped with SQL Server.   Before you add these to
your noise files you must consider their usefulness in a search.  Normally
prepositions, adjectives, pronouns can be safely removed without too much
consideration.  Nouns should be left in, unless it is your company name.
Product names should never be removed (otherwise how can your customers find
your products), self - reflexive verbs should go (walk, wash, run, feed,
etc), but make careful decisions about most other verbs - i.e. install,
remove, take, etc.  SQL, Server, ODBC, Drive, version are all probably words
that people would be searching on.

One problem with this script is that it is note very intelligent; run, runs,
and ran are all considered to be separate words.  You might also want to
print out an alphabetical list to spot these word forms.  To do this change
the line

foreach $word (sort by_count keys(%wordcount)) {

to

foreach $word (sort keys(%wordcount)) {.

You might want to take the word list generated and remove what you think are
significant words and then use the remainder as your noise word list.

Be careful about removing too many words from your noise word list.
Performance degrades significantly, and the index swells proportionally the
more words you remove from your list.

Your noise word lists can be found at c:\Program Files\Microsoft SQL
Server\MSSQL\FTData\SQLServer\Config.  You have to stop the MSSearch service
to make changes to your noise word list.  Rebuild your indexes after making
changes to your noise word lists.  On Win2k systems you must edit all noise
files - the ones in the SQL Server directory as well as the ones in
c:\winnt\system32

--
Hilary
www.iisfaq.com

Quote:> We have a full text index that indexes over 13 million
> rows.  We have change tracking enabled and it updates in
> the background.  Which works wonderfully.

> We have recently modified our stored procedure that does
> the searching through our database so that we can specify
> words that are just too expensive to bother searching on,
> the procedure will simply skip them.

> My question is this, is there a way to get a "report" or
> something of that nature from the full text indexer so we
> can more easily identify which of these words has the most
> hits? For example, we have a field in the full text index
> called company name, it would not make sense, and would be
> a waste of resources to search on the word "company" in
> that field, as it would bring back far too many results
> for it to be useful.

> I know some of you might suggest adding these words to the
> noise word file, and indeed we are, but I am trying to
> find out an easier way to identify these words.

> Thank you in advance for you time!

> Jeremy Hinkle
> Austin-Tetra


 
 
 

1. FILLFACTOR (SCAN DENSITY VS. AVERAGE PAGE DENSITY)

Scenario:
 Created a clustered index and nonclustered indexes on a
table with 5,953,675 rows using a fillfactor of "95" for
clustered index and "98" for nonclustered indexes.
  After 82,119 inserts (.01 of table), the nonclustered
indexes had a value of "83" for scan density, "4" for
logical frag, and "96" for avg. page density.  The
clustered index had a value of "96" for scan density, "1"
for logical frag, and "88" for avg. page density.
  Methinks I should have used a fillfactor for the
nonclustered indexes a value of "96", and a fillfactor
value of "96" for the clustered index.  The table and
indexes span multiple files.  Confused on trade-off on
scan density versus avg. page density with nightly inserts.
Am in an OLAP (Data Warehouse) environment.  Have other
tables with similiar scenarios (roughly one percent of
table being done nightly with inserts).
  Am open to anything.  Thank you

2. Large XML files into SQL

3. Scheduled full-text incremental population for full-text catalog

4. rescue table help

5. SQLMail Error..

6. Full-text catalog catalog count is 0 event after full-populate

7. Connecting to Access 2000/97

8. Want to full populate full text search through trigger

9. Full-text: how to speed up full population

10. Full-text full-population running slow

11. Search over Text, Is full-text only solution?

12. Full Text Indexing of Rich Text/ Word Documents