I tried to find my old code over the weekend, couldn't find it (> 10 yr
Anyway, I remember doing different things, amongst which:
1) Indexing based on words,
2) Doing periodically a rescan of the full index.
I - Building lookup-table based on words.
The idea was that the user had direct respons when typing a
character (knowing if there are hits).
This approach brought good results, though I didn't try with so many
records as you work with.
+ Relatively fast respons on finding the word,
+ Showing the user quickly the number of hits.
- Maybe slower in generating the final results
+/- Order in results where in ascending/descending order of entry
(recent changes show first)
'A' = Main table, having 400.000 records, thus approx 400.000 * 5
'B' = Lookup table, holding much less words, probably no more than
20.000. (just a guess)
Table 'B' is the lookup-table.
Table 'B' had just one field for words and a MEMO holding the
pointers (unique tag) of table 'A'.
Typically it looked as follows:
Smile ,3024,2055,3569,55643, 4
Smith ,4045,33789,9076,2246,33784, 5
The lookup table was indexed on CWORD (obviously.)
The MEMO holds the pointers seperated with a seperator-character,
making AT(), SUBSTR() and STRTRAN() able to find/parse the pointer.
You can parse the pointers two ways, left-to-right, in order of
'oldness', or right-to-left, (recent changes first).
The NHITS are updated whenever a record-pointer is added, and on a
monthly refresh, done with OCCURS()
Table B was updated when table A was updated (now you would do it
with the update/insert triggers), parsing new
words and/or pointers into the B table.
(This is a 2-step action, you have to take care of 'removing
pointers when words disappear from a string).
Later, I added to this some more MEMO fields, to show in different
This was done by a programm that ran every month. It would reindex
the table A, putting all pointers in concequent fields based on the
CWORD M_DATEORDER M_ALPHAORDER
Smile ,3024,2055,3569,55643, blablabla
Smith ,4045,33789,9076,2246 blablabla
II - Building a lookup table based on phonetic.
Later, I experimented with bringing this a step further by adding SOUNDEX()
field. It eliminated typing errors during lookup by the user a bit, I
suppose SOUNDEX() works better in English than in Dutch...
In general, I found that building a 'tree' of lookup activities speeds up
performance and makes the search more effective.
From above, you see that I took the words itselve to work from, since the
amount of words appeared to be much less than the amounts of records.
(When I started to experiment with speech-interfaces, I came accross similar
Therefore, it may be an idea to split up words even further than words.
If you find you have about 20.000 words, it may be worthwhile trying to
split the lookup-table up again and add yet another table that holds
'particles' of words.
Since you'll find that when you have a lot of words, you also have a lot of
'longer' words that can look like 'AUTOMOBILE'.
This word could be split up in 'AUTO' and 'MOBILE'.
This is why I experimented also with my parser that would maintain the
lookup table (adding new words).
I never implemented it, but remember doing following:
IV - Splitting words into discrete particles:
When a new word needed to be added, and the length of the word was bigger
then 5 characters, the word-table (B) was first searched for matching
records with UNIQUE=OFF.
If a word was found, the rest of the long word was searched again. (Using
LEN(ALLTRIM()) and SUBSTR() to find the
rest of the word)
If also this brought a hit, the word was assumed to be typically a 'combined
So, when to table A (your big database) a string is added holding
'AUTOMOBILE', the parser would during the index-activity find:
1) AUTO. && then looking for the reminder of the word, being 'MOBILE'
2) If 'MOBILE' was found also, the word AUTOMOBILE was added to table
'B', but also added in discrete particles
to a new table 'C'
NWHERE_IN_WORDBASE (TABLE 'B')
AUTO AUTOMOBILE 1003
MOBILE AUTOMOBILE 1003
The 'split' could split upto more than 2 particles. Like in 'GASWORKSTATION'
(we have in dutch quite some crazy
long words like "Rioolwaterzuiveringsinstallatie" allthough the Germans beat
Any way, the user can search for 'AUTOMOBILE', finding the records directly
from table B,
the user can also look for 'MOBILE', finding all 'MOBILE' directly from
table B, but also table C would be searched
and hits on 'AUTOMOBILE' would be found as well.
The neet thing about this approach was that the system 'learns' how to split
words as they go along.
AUTOMOBILE is split due to the fact that both 'MOBILE' and AUTO appears to
be a valid words.
Both AUTO and MOBILE are valid particle words due to the fact that they
exist as seperate words in table A.
During this process, a word like 'THEATRE' would only be split in 'THE' and
'ATRE' if 'ATRE' is a word that appears to exist as a 'stand-alone'word in
one of the other records. (By the way, I bothered on words >= 5 characters
only, so this word woudn't be split anyhow...)
A word as 'THEATRECITY' would obviously be split in THEATRE and CITY.
Note: If your write your table smart, both B en C could be contained within
the same table. Adding some field to indicate
it is a 'B' or 'C'-type record.
IV - Showing fast results.
The nice way to do things is to search on every character's press.
The search therefore took place after every character typed.
e.g. User typed: MO..
--> doing a seek for MO.. in table B, and C, plus a SCAN to sum number of
if number of hits exceed more then 1000, abort the scan. and do nothing
---> doing a seek for MO... in table C. (When 'B' and 'C' is the same table,
the search is done in one go).
Hope it helps.
It'll give you some days to experiment I suppose.
David Daniels <d...@mainstreetcomp.com> wrote in message
> Here is the answer to both of your responses.
> 1)B. We want to match only 1 word when searching. ie) "smith" will find
> "John Smith", "Fred Smith", and ideally but not mandatory to also find
> "Joe Smithton".
> 2) A: the problem with our current approach without the second table is
> SPEED, Problem with the 2nd table is 'maintenance'. We wrote the
> front-end of a POS(point-of-sale) system and all file maintenance is
> done via COMMS (that we didn't write). We have a nightly maintenance
> program the rebuilds the 2nd table with correct custnum&search-words. We
> could add an insert/update trigger to do this but that is not desirable.
> B.Maximum words may be 5 or so.
> C. I don't know how many differnt words but I think < 400,000.
> D. We could add a trigger and even add flags if we pressure our clients
> MIS department.
> E. Adding tables is possible but just inconvienient.
> -----Original Message-----
> It seems there are indeed some very good 3d party indexers (Ive heard
> I can't really help you with 3d party stuff, but I can remember that I
> some optimization
> within foxpro (a long time ago) with pretty good results.
> To see if the approach fit:
> A) What is the problem with your current approach? (Speed, Size,
> B) Howmany words are there maximum in the 40-character field?
> C) Any idea howmany different words alltogether? (probably less then
> D) Can you change the original table (add flag, triggers etc) or is it
> untouchable for you?
> E) is adding tables impossible, or just out of conveniance...
> David Daniels <d...@mainstreetcomp.com> wrote in message
> > Problem. Large customer table (400,000) records that has a 40
> > 'lookup/match string'. I want to be able to lookup this customer by
> > 'ANY' of the 'words/characters' in the match-string. Does anyone know
> > an indexing technique of putting multiple indexes of individual words
> > within 1 field???
> > (I have created a 2nd subordinate customer file by parsing the
> > customer-search string and making custnum and multiple 1-word
> > as an index and 'joinging' to the master file but i am looking for a
> > solution without creating additional tables).