Phonetic search codes

Phonetic search codes

Post by Michael Ku » Thu, 30 Nov 1995 04:00:00



The following describes a replacement for the traditional "Soundex" coding
used to create a "phonetic" string for persons names.

I have produced a C version of this routine and sent it for inclusion
in the Informix archive. You can find it there.

Over the last 3 weeks I have looked at the differences between using
"Soundex" over "Metaphone" for retrieval of possible matches of persons
in the database.

I used 20,000+ unique last names and found that for the most part
Metaphone does group together names that Soundex would not. The primary
reason for this is that Soundex ALWAYS uses the first letter of the
name as part of the code.

However, with Metaphone you find MANY instances where the following
occurs:
               FALLIS & VALLIS would have DIFFERENT Soundex
               KLINE  & CLINE             SAME      Metaphone

I personally have not implemented it yet because. . .

   KLINE & CLINE in the same "group" means that there are many, many,
   many, many more names that are going to be retrieved for each
   target group that you are going after.

My paticular problem is to find matching names that are SOMETIMES
mispelled slightly. The input persons usually have the first letter
of the last name. Its the stuff in the middle that is screwed up.

Basically, I am looking for a "fuzzy filter". A couple of years
ago in Byte magazine and Unix Review there were articles on "agrep"
(approximate grep) that used some new algorithms and expanded adapations
of old algorithms.

I took collections of names with the same metaphone groups and put them
in an ascii file and used agrep to search for names by misspelling them, etc.

The results where "stunning" in my opinion. As an example 58 last names
from a metaphone "group" where reduced to a set of 14 BEST GUESSes by agrep.

I looked at the C code for agrep. In order to get the basic "fuzzy filter"
routine to call from 4GL would require someone to get really familiar with
these algorithms. More time than I have right now.

While y'all are driving around on the net keep a lookout for something
like this, that can be imbedded in 4GL.

Thanks.

 Metaphone Algorithm

   Created by Lawrence Philips (location unknown). Metaphone presented
   in article in "Computer Language" December 1990 issue.

             *********** BEGIN METAPHONE RULES ***********

 Lawrence Philips' RULES follow:

 The 16 consonant sounds:
                                             |--- ZERO represents "th"
                                             |
      B  X  S  K  J  T  F  H  L  M  N  P  R  0  W  Y

 Exceptions:

   Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-"  ----> drop first letter
                      "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"

   Beginning of word: "x"                                ----> change to "s"
                                      as in "Deng Xiaopeng"

   Beginning of word: "wh-"                              ----> change to "w"
                                      as in "Whalen"

 Transformations:

   B ----> B      unless at the end of word after "m", as in "dumb", "McComb"

   C ----> X      (sh) if "-cia-" or "-ch-"
           S      if "-ci-", "-ce-", or "-cy-"
                  SILENT if "-sci-", "-sce-", or "-scy-"
           K      otherwise, including in "-sch-"

   D ----> J      if in "-dge-", "-dgy-", or "-dgi-"
           T      otherwise

   F ----> F

   G ---->        SILENT if in "-gh-" and not at end or before a vowel
                            in "-gn" or "-gned"
                            in "-dge-" etc., as in above rule
           J      if before "i", or "e", or "y" if not double "gg"
           K      otherwise

   H ---->        SILENT if after vowel and no vowel follows
                         or after "-ch-", "-sh-", "-ph-", "-th-", "-gh-"
           H      otherwise

   J ----> J

   K ---->        SILENT if after "c"
           K      otherwise

   L ----> L

   M ----> M

   N ----> N

   P ----> F      if before "h"
           P      otherwise

   Q ----> K

   R ----> R

   S ----> X      (sh) if before "h" or in "-sio-" or "-sia-"
           S      otherwise

   T ----> X      (sh) if "-tia-" or "-tio-"
           0      (th) if before "h"
                  silent if in "-tch-"
           T      otherwise

   V ----> F

   W ---->        SILENT if not followed by a vowel
           W      if followed by a vowel

   X ----> KS

   Y ---->        SILENT if not followed by a vowel
           Y      if followed by a vowel

   Z ----> S
--
Michael J. Kuhn  Computer Systems Consultant  phone:410-254-7060


       c/o Baltimore Rh Typing Laboratory, Inc.  phone:410-225-9595