Indexing technique needed for large table lookup

Indexing technique needed for large table lookup

Post by David Daniel » Thu, 20 Jul 2000 04:00:00



Problem. Large customer table (400,000) records that has a 40 character
'lookup/match string'. I want to be able to lookup this customer by
'ANY' of the 'words/characters' in the match-string. Does anyone know of
an indexing technique of putting multiple indexes of individual words
within 1 field???
(I have created a 2nd subordinate customer file by parsing the
customer-search string and making custnum and multiple 1-word cust-names
as an index and 'joinging' to the master file but i am looking for a
solution without creating additional tables).

 
 
 

Indexing technique needed for large table lookup

Post by Stephen Russel » Fri, 21 Jul 2000 04:00:00


Bad Idea. There is a 3rd party product.  Try PHdBase to index this field and
give you hits.  It's 300 +

__Stephen


Quote:> Problem. Large customer table (400,000) records that has a 40 character
> 'lookup/match string'. I want to be able to lookup this customer by
> 'ANY' of the 'words/characters' in the match-string. Does anyone know of
> an indexing technique of putting multiple indexes of individual words
> within 1 field???
> (I have created a 2nd subordinate customer file by parsing the
> customer-search string and making custnum and multiple 1-word cust-names
> as an index and 'joinging' to the master file but i am looking for a
> solution without creating additional tables).


 
 
 

Indexing technique needed for large table lookup

Post by Glenn Nava » Tue, 25 Jul 2000 04:00:00


David, i understand your problem since in our audit projects, we usually
have fields like a check description field where it would have a p.o.
number, invoice number, department no, etc.

like u, we need a way to search this field for values whether it's an
invoice number or po number. we do this because in auditing, we need to know
if an overpayment on an invoice has been adjusted already by our client.
otherwise, we have  a claim.

i researched the problem months ago and concluded there's no way u could do
it in VFP.

but we found a workaround by exporting the concerned tables to SQL Server
which has a text indexing feature.

then we just use excel or vfp or access to search for it in SQL Server.

hope this helps...

Glenn Naval


Quote:> Problem. Large customer table (400,000) records that has a 40 character
> 'lookup/match string'. I want to be able to lookup this customer by
> 'ANY' of the 'words/characters' in the match-string. Does anyone know of
> an indexing technique of putting multiple indexes of individual words
> within 1 field???
> (I have created a 2nd subordinate customer file by parsing the
> customer-search string and making custnum and multiple 1-word cust-names
> as an index and 'joinging' to the master file but i am looking for a
> solution without creating additional tables).

 
 
 

Indexing technique needed for large table lookup

Post by Reind Brackma » Thu, 27 Jul 2000 04:00:00


It seems there are indeed some very good 3d party indexers (Ive heard
of...).
I can't really help you with 3d party stuff, but I can remember that I did
some optimization
within foxpro (a long time ago) with pretty good results.
To see if the approach fit:

A) What is the problem with your current approach? (Speed, Size, Indexing
procedure?)
B) Howmany words are there maximum in the 40-character field?
C) Any idea howmany different words alltogether? (probably less then
400.000?)
D) Can you change the original table (add flag, triggers etc) or is it
untouchable for you?
E) is adding tables impossible, or just out of conveniance...

Reind


Quote:

> Problem. Large customer table (400,000) records that has a 40 character
> 'lookup/match string'. I want to be able to lookup this customer by
> 'ANY' of the 'words/characters' in the match-string. Does anyone know of
> an indexing technique of putting multiple indexes of individual words
> within 1 field???
> (I have created a 2nd subordinate customer file by parsing the
> customer-search string and making custnum and multiple 1-word cust-names
> as an index and 'joinging' to the master file but i am looking for a
> solution without creating additional tables).

 
 
 

Indexing technique needed for large table lookup

Post by Reind Brackma » Thu, 27 Jul 2000 04:00:00


One more question.

Do you mean by ANY of the words that you want also to search for multiple
words (B)?
Or do you just search for 1 word, and want outcomes of records containing
the word in the field regardless it's position (A)?
Or alternatively, looking at a search string as to individual searches (C).
Or even enabling sentences (D)

So, do you want to search for:

A)        'Smith John'            Outcome:        'John Smith'
&& both match
                                                                'Smith
Albert John'         && both match
                                                                'Jim John
Smith'

B)        'Smith'                    Outcome:        'John Smith'
&& just one searchword
                                                                'Smith
Albert John'
                                                                'Carla
Smith'
                                                                'Smith'
                                                                'Jim John
Smith'

C)    'Smith John'                Outcome:        'John Smith'
&& multiple searchwords, any match
                                                                'Smith
Albert John'
                                                                'Carla
Smith'
                                                                'Smith'
                                                                'John'
                                                                'Jim John
Smith'

D)    'Smith John'                Outcome:        'John Smith'
&& multiple searchwords, any match
                                                                'Jim John
Smith'

The approach depends on above.


Quote:

> Problem. Large customer table (400,000) records that has a 40 character
> 'lookup/match string'. I want to be able to lookup this customer by
> 'ANY' of the 'words/characters' in the match-string. Does anyone know of
> an indexing technique of putting multiple indexes of individual words
> within 1 field???
> (I have created a 2nd subordinate customer file by parsing the
> customer-search string and making custnum and multiple 1-word cust-names
> as an index and 'joinging' to the master file but i am looking for a
> solution without creating additional tables).

 
 
 

Indexing technique needed for large table lookup

Post by David Daniel » Sat, 29 Jul 2000 04:00:00


Reind,
Here is the answer to both of your responses.
1)B. We want to match only 1 word when searching. ie) "smith" will find
"John Smith", "Fred Smith", and ideally but not mandatory to also find
"Joe Smithton".

2) A: the problem with our current approach without the second table is
SPEED, Problem with the 2nd table is 'maintenance'. We wrote the
front-end of a POS(point-of-sale) system and all file maintenance is
done via COMMS (that we didn't write). We have a nightly maintenance
program the rebuilds the 2nd table with correct custnum&search-words. We
could add an insert/update trigger to do this but that is not desirable.
B.Maximum words may be 5 or so.
C. I don't know how many differnt words but I think < 400,000.
D. We could add a trigger and even add flags if we pressure our clients
MIS department.
E. Adding tables is possible but just inconvienient.

David

-----Original Message-----

It seems there are indeed some very good 3d party indexers (Ive heard
of...).
I can't really help you with 3d party stuff, but I can remember that I
did
some optimization
within foxpro (a long time ago) with pretty good results.
To see if the approach fit:

A) What is the problem with your current approach? (Speed, Size,
Indexing
procedure?)
B) Howmany words are there maximum in the 40-character field?
C) Any idea howmany different words alltogether? (probably less then
400.000?)
D) Can you change the original table (add flag, triggers etc) or is it
untouchable for you?
E) is adding tables impossible, or just out of conveniance...

Reind



.

> Problem. Large customer table (400,000) records that has a 40
character
> 'lookup/match string'. I want to be able to lookup this customer by
> 'ANY' of the 'words/characters' in the match-string. Does anyone know
of
> an indexing technique of putting multiple indexes of individual words
> within 1 field???
> (I have created a 2nd subordinate customer file by parsing the
> customer-search string and making custnum and multiple 1-word
cust-names
> as an index and 'joinging' to the master file but i am looking for a
> solution without creating additional tables).

 
 
 

Indexing technique needed for large table lookup

Post by Reind Brackma » Tue, 01 Aug 2000 04:00:00


Hmm...

I tried to find my old code over the weekend, couldn't find it (> 10 yr
old).
Anyway, I remember doing different things, amongst which:

1) Indexing based on words,
2) Doing periodically a rescan of the full index.

I    - Building lookup-table based on words.
        ---------------------------------------
        The idea was that the user had direct respons when typing a
character (knowing if there are hits).
        This approach brought good results, though I didn't try with so many
records as you work with.
        + Relatively fast respons on finding the word,
        + Showing the user quickly the number of hits.
        - Maybe slower in generating the final results
        +/- Order in results where in ascending/descending order of entry
(recent changes show first)

        'A' = Main table, having 400.000 records, thus approx 400.000 * 5
words.
        'B' = Lookup table, holding much less words, probably no more than
20.000. (just a guess)
        Table 'B' is the lookup-table.
        Table 'B' had just one field for words and a MEMO holding the
pointers (unique tag) of table 'A'.
        Typically it looked as follows:

        CWORD               MWHERE
NHITS
        Smile                     ,3024,2055,3569,55643,                   4
        Smith                     ,4045,33789,9076,2246,33784,        5

        The lookup table was indexed on CWORD (obviously.)
        The MEMO holds the pointers seperated with a seperator-character,
making AT(), SUBSTR() and STRTRAN() able to         find/parse the pointer.
        You can parse the pointers two ways, left-to-right, in order of
'oldness', or right-to-left, (recent changes first).
        The NHITS are updated whenever a record-pointer is added, and on a
monthly refresh, done with OCCURS()
        Table B was updated when table A was updated (now you would do it
with the update/insert triggers), parsing new
        words and/or pointers into the B table.
        (This is a 2-step action, you have to take care of 'removing
pointers when words disappear from a string).

        Later, I added to this some more MEMO fields, to show in different
orders.
        This was done by a programm that ran every month. It would reindex
the table A, putting all pointers in concequent fields         based on the
desired order:

        CWORD               M_DATEORDER                M_ALPHAORDER
M_NEWENTRIES        NHITS
        Smile                     ,3024,2055,3569,55643,       blablabla
,400001,                         5
        Smith                     ,4045,33789,9076,2246        blablabla
4

II - Building a lookup table based on phonetic.

Later, I experimented with bringing this a step further by adding SOUNDEX()
field. It eliminated typing errors during lookup by the user a bit, I
suppose SOUNDEX() works better in English than in Dutch...

'LEARNING WORDS':

In general, I found that building a 'tree' of lookup activities speeds up
performance and makes the search more effective.
From above, you see that I took the words itselve to work from, since the
amount of words appeared to be much less than the amounts of records.
(When I started to experiment with speech-interfaces, I came accross similar
techniques.)

Therefore, it may be an idea to split up words even further than words.
If you find you have about 20.000 words, it may be worthwhile trying to
split the lookup-table up again and add yet another table that holds
'particles' of words.
Since you'll find that when you have a lot of words, you also have a lot of
'longer' words that can look like 'AUTOMOBILE'.
This word could be split up in 'AUTO' and 'MOBILE'.

This is why I experimented also with my parser that would maintain the
lookup table (adding new words).
I never implemented it, but remember doing following:

IV - Splitting words into discrete particles:

When a new word needed to be added, and the length of the word was bigger
then 5 characters, the word-table (B) was first searched for matching
records with UNIQUE=OFF.
If a word was found, the rest of the long word was searched again. (Using
LEN(ALLTRIM()) and SUBSTR() to find the
rest of the word)
If also this brought a hit, the word was assumed to be typically a 'combined
word'

So, when to table A (your big database) a string is added holding
'AUTOMOBILE', the parser would during the index-activity find:
1)    AUTO.   && then looking for the reminder of the word, being 'MOBILE'
2)     If 'MOBILE' was found also, the word AUTOMOBILE was added to table
'B', but also added in discrete particles
        to a new table 'C'

        CPARTICLE              CMAINWORD
NWHERE_IN_WORDBASE (TABLE 'B')
        AUTO                        AUTOMOBILE                    1003
        MOBILE                    AUTOMOBILE                    1003

The 'split' could split upto more than 2 particles. Like in 'GASWORKSTATION'
(we have in dutch quite some crazy
long words like "Rioolwaterzuiveringsinstallatie" allthough the Germans beat
us...)

Any way, the user can search for 'AUTOMOBILE', finding the records directly
from table B,
the user can also look for 'MOBILE', finding all 'MOBILE' directly from
table B, but also table C would be searched
and hits on 'AUTOMOBILE' would be found as well.

The neet thing about this approach was that the system 'learns' how to split
words as they go along.
AUTOMOBILE is split due to the fact that both 'MOBILE' and AUTO appears to
be a valid words.
Both AUTO and MOBILE are valid particle words due to the fact that they
exist as seperate words in table A.

During this process, a word like 'THEATRE' would only be split in 'THE' and
'ATRE' if 'ATRE' is a word that appears to exist as a 'stand-alone'word in
one of the other records. (By the way, I bothered on words >= 5 characters
only, so this word woudn't be split anyhow...)
A word as 'THEATRECITY' would obviously be split in THEATRE and CITY.

Note: If your write your table smart, both B en C could be contained within
the same table. Adding some field to indicate
it is a 'B' or 'C'-type record.

IV - Showing fast results.

The nice way to do things is to search on every character's press.
The search therefore took place after every character typed.

e.g.  User typed:            MO..
--> doing a seek for MO.. in table B, and C,  plus a SCAN to sum number of
hits.
if number of hits exceed more then 1000, abort the scan. and do nothing
further
---> doing a seek for MO... in table C. (When 'B' and 'C' is the same table,
the search is done in one go).

Hope it helps.
It'll give you some days to experiment I suppose.

Reind

David Daniels <d...@mainstreetcomp.com> wrote in message

news:38F675F48B36D4118E4D00508B5EBA37CD6E44@cpmsftmsgv21.microsoft.com...

> Reind,
> Here is the answer to both of your responses.
> 1)B. We want to match only 1 word when searching. ie) "smith" will find
> "John Smith", "Fred Smith", and ideally but not mandatory to also find
> "Joe Smithton".

> 2) A: the problem with our current approach without the second table is
> SPEED, Problem with the 2nd table is 'maintenance'. We wrote the
> front-end of a POS(point-of-sale) system and all file maintenance is
> done via COMMS (that we didn't write). We have a nightly maintenance
> program the rebuilds the 2nd table with correct custnum&search-words. We
> could add an insert/update trigger to do this but that is not desirable.
> B.Maximum words may be 5 or so.
> C. I don't know how many differnt words but I think < 400,000.
> D. We could add a trigger and even add flags if we pressure our clients
> MIS department.
> E. Adding tables is possible but just inconvienient.

> David

> -----Original Message-----
> It seems there are indeed some very good 3d party indexers (Ive heard
> of...).
> I can't really help you with 3d party stuff, but I can remember that I
> did
> some optimization
> within foxpro (a long time ago) with pretty good results.
> To see if the approach fit:

> A) What is the problem with your current approach? (Speed, Size,
> Indexing
> procedure?)
> B) Howmany words are there maximum in the 40-character field?
> C) Any idea howmany different words alltogether? (probably less then
> 400.000?)
> D) Can you change the original table (add flag, triggers etc) or is it
> untouchable for you?
> E) is adding tables impossible, or just out of conveniance...

> Reind

> David Daniels <d...@mainstreetcomp.com> wrote in message
> news:38F675F48B36D4118E4D00508B5EBA378CF82C@cpmsftmsgv21.microsoft.com..
> .

> > Problem. Large customer table (400,000) records that has a 40
> character
> > 'lookup/match string'. I want to be able to lookup this customer by
> > 'ANY' of the 'words/characters' in the match-string. Does anyone know
> of
> > an indexing technique of putting multiple indexes of individual words
> > within 1 field???
> > (I have created a 2nd subordinate customer file by parsing the
> > customer-search string and making custnum and multiple 1-word
> cust-names
> > as an index and 'joinging' to the master file but i am looking for a
> > solution without creating additional tables).

 
 
 

Indexing technique needed for large table lookup

Post by Reind Brackma » Tue, 01 Aug 2000 04:00:00


A couple of more things.

-    I think you will find yourselve adding files anyhow, so why not a
table.
-    Your example does not require splitting words in particles.
      (In my last letter I stated 'Exclusive, this obviously must be Exact'
and SET NEAR ON)
      Seek on 'Smith' with SET NEAR ON will find Smithy

Good luck.
Reind


> Reind,
> Here is the answer to both of your responses.
> 1)B. We want to match only 1 word when searching. ie) "smith" will find
> "John Smith", "Fred Smith", and ideally but not mandatory to also find
> "Joe Smithton".

> 2) A: the problem with our current approach without the second table is
> SPEED, Problem with the 2nd table is 'maintenance'. We wrote the
> front-end of a POS(point-of-sale) system and all file maintenance is
> done via COMMS (that we didn't write). We have a nightly maintenance
> program the rebuilds the 2nd table with correct custnum&search-words. We
> could add an insert/update trigger to do this but that is not desirable.
> B.Maximum words may be 5 or so.
> C. I don't know how many differnt words but I think < 400,000.
> D. We could add a trigger and even add flags if we pressure our clients
> MIS department.
> E. Adding tables is possible but just inconvienient.

> David

> -----Original Message-----
> It seems there are indeed some very good 3d party indexers (Ive heard
> of...).
> I can't really help you with 3d party stuff, but I can remember that I
> did
> some optimization
> within foxpro (a long time ago) with pretty good results.
> To see if the approach fit:

> A) What is the problem with your current approach? (Speed, Size,
> Indexing
> procedure?)
> B) Howmany words are there maximum in the 40-character field?
> C) Any idea howmany different words alltogether? (probably less then
> 400.000?)
> D) Can you change the original table (add flag, triggers etc) or is it
> untouchable for you?
> E) is adding tables impossible, or just out of conveniance...

> Reind



> .

> > Problem. Large customer table (400,000) records that has a 40
> character
> > 'lookup/match string'. I want to be able to lookup this customer by
> > 'ANY' of the 'words/characters' in the match-string. Does anyone know
> of
> > an indexing technique of putting multiple indexes of individual words
> > within 1 field???
> > (I have created a 2nd subordinate customer file by parsing the
> > customer-search string and making custnum and multiple 1-word
> cust-names
> > as an index and 'joinging' to the master file but i am looking for a
> > solution without creating additional tables).

 
 
 

Indexing technique needed for large table lookup

Post by Anders Altber » Sat, 05 Aug 2000 04:00:00


Hi
Using standard SQL you SELECT lookupfield FROM table ;
WHERE UPPER(lookupfield) LIKE '%SMITH%'
or
LOCATE FOR LIKE("*SMITH*,UPPER(lookup))
? recno()
DO WHILE FOUND()
  CONTINUE
    IF FOUND()
        ? RECNO()
    ENDIF
ENDDO

You could also try VFP's fast text searching

COPY lookupfield TO temp.txt TYPE SDF
CREATE CURSOR Temp (lookfor M)
APPEND BLANK
APPEND MEMO lookfor FROM temp.txt
SET MEMOWIDTH TO 100
rowlength=40
s = 'SMITH'
n = OCCURS(M.s, UPPER(Temp.lookfor))
IF M.n > 0
DIMENSION aRows(M.n,1)
FOR i = 1 TO M.n
    aRows(M.i,1) = CEILING(ATC(M.s, Temp.lookfor)/M.rowlength)
NEXT
ENDIF

The array will contain the record numbers of rows with 'SMITH, 'Smith'
,'Smithsonian'
Instead of a  memo you could use a big string created with FILETOSTR()
SELECT lookupfield FROM <Table> TO FILE temp.txt
string = FileToStr('temp.txt')
You can SELECT UPPER(..) and then use AT instead of ATC for a possible
marginal difference.
-Anders


Quote:> Problem. Large customer table (400,000) records that has a 40 character
> 'lookup/match string'. I want to be able to lookup this customer by
> 'ANY' of the 'words/characters' in the match-string. Does anyone know of
> an indexing technique of putting multiple indexes of individual words
> within 1 field???
> (I have created a 2nd subordinate customer file by parsing the
> customer-search string and making custnum and multiple 1-word cust-names
> as an index and 'joinging' to the master file but i am looking for a
> solution without creating additional tables).

 
 
 

1. indexing a large, non-indexed table

We've got a table that quickly filled up with over 30 million rows
before we realized that our software installers neglected to index the
table. Now they tell us we have to index it.

I assume it will either not work or will take too long to just flip
the index in Enterprise Manager.

Any suggestions?

My plan barring any clearer suggestions will be to run hundreds of
"insert into select from" statements by selecting ids between 10 and
50 first, then moving on to 51-100 etc. But this is not a great
solution - since the table is not currenlty indexed, any where clause
takes a long time to process.

I would appreciate any cleaner solutions any of you may have.

Thanks,
Dave

2. PooledConnection and Connection question

3. Lookup Tables to Lookup Tables

4. FKs and indexes

5. Getting Good Performance On Large Lookup Tables

6. Help, I need a SQL Server DBA

7. need temp table technique

8. CLARIS SDK HELP PLEASE

9. name/value pairs in a table, need help with reporting techniques

10. Use IDENTITY index on Lookup Table?

11. fastest way to lookup an indexed table entry???

12. Lookup Field in a Table, add an index

13. Table lookup sort by sec index.