SOUNDEX?

SOUNDEX?

Post by Olav Tollefse » Sun, 16 Nov 2003 11:10:43



I would like to search on strings that sound similar and to do this I'm
experimenting with the SOUNDEX and DIFFERENCE functions.

One problem that I have found is that the SOUNDEX function is not capable of
detecting rather huge differences, like in this example:

SOUNDEX('Margretes vei') = M626
SOUNDEX('Merkurveien') = M626
DIFFERENCE(''Margretes vei', 'Merkurveien') = 4 (lowest possible difference)

How can I get this more accurate?
--
Olav Tollefsen

 
 
 

SOUNDEX?

Post by Joe Celk » Mon, 17 Nov 2003 02:01:08


This is taken from my book on advanced SQL:

 6.3.5.  SOUNDEX FUNCTIONS

 People's names are a special problem for designers of databases.  There
are often several ways to spell the same name, as anyone who has gotten
junk mail addressed to some mutation of his name can testify.  I get
mail addressed to "Selco", "Selko", "Celco" as well as "Celko".  I also
have some near misses such as "Cellro", "Chelco" and "Chelko" in my mail
stack.  These are all phonetic errors.  

 To solve this problem, we need a algorithm which will group names with
a similar sound together under one code.  These family of algorithms are
called Soundex algorithms, after the original algorithm of that name.  A
Soundex algorithm takes a person's name as input and produces a
character string which identifies a set of names that are (roughly)
phonetically alike.  

 A few versions of SQL, such as WATCOM SQL, and some other 4GL products
have a Soundex algorithm in their library functions.  It is also
possible to computer a Soundex in SQL, using some of the string
functions in the SQL-92 standard, but it is difficult.  The different
steps in the algorithms have to be done as multiple updates on the same
column and it is simply not worth the effort.

 Programmers will usually be compute the Soundex in a host language
outside of the database, then insert it into a column next to the name
which it indexes.

 6.3.5.1.  THE ORIGINAL SOUNDEX

 The original Soundex algorithm was patented by Margaret O'Dell and
Robert C.  Russell in 1918.  The method is based on the phonetic
classification of sounds by how they are made.  In case you wanted to
know, the six groups are bilabial, labiodental, dental, alveolar, velar,
and glottal.  The algorithm is fairly straight forward to code and
requires no backtracking or multiple passes over the input word.  This
should not be too surprising, since it was in use before computers and
had to be done by hands by clerks.  Here is the algorithm:

 (1.0) Capitalize all letters in the word.  Pad the word with rightmost
blanks as needed during each procedure step.  

 (2.0) Retain the first letter of the word.

 (3.0) Drop all occurrence of the following letters after the first
position: 'A', E', 'H', 'I', 'O', 'W', 'Y'.

 (4.0) Change the following sets of letters into the digit given:

 1  =  'B', 'F', 'P', 'V'
 2  =  'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'
 3  =  'D', 'T'
 4  =  'L'
 5  =  'M', 'N'
 6  =  'R'

 (5.0) Remove all pairs of digits which occur beside each other from the
string that resulted after step (4.0).

 (6.0) Pad the string that resulted from step (5.0) with trailing zeros
and return only the first four positions, which will be for the form
<uppercase letter> <digit> <digit> <digit>.

 An alternative version of the algorithm, due to Russell, changes the
letters in step (3.0) to 9's, retaining them.  Then step (5.0) is
replaced by steps (5.1) which removes duplicates as before, followed by
step (5.2), which removes all 9's and closes up the spaces.  This allows
pairs of digits to appear in the result string.  The version has more
granularity and will work better for a larger sample of names.  

 6.3.5.2.  AN IMPROVED SOUNDEX

 The Soundex algorithm given here is a procedure which will take a name
and return a four letter code.  The codes tend to group those English
names which are phonetically close together in smaller and more accurate
groups than the original algorithm.  It is not perfect by any means, but
does a good job given a large database.  Its main advantage is that it
considers groups of letters which can have a different phonetic value
from the letters taken separately.  It also maps the names into a code
space with more granularity - that means there are more possible codes
than provided by the original Soundex.  Here is the algorithm:

 (1.0) Capitalize all letters in the word.  Pad the word with rightmost
blanks as needed during each procedure step.  

 (2.0) Replace all non-leading vowels with 'A'.

 (3.0) Use this table to change prefixes:

 Prefix    Transform
      MAC       MCC
      KN        NN
      K         C
      PF        FF
      SCH       SSS
      PH        FF

 (4.0) Phonetic changes are made on the part of the word after the first
letter, according to these sub-rules:

 (4.1)Transform certain letter combinations:
      text      Transform    
      DG        GG
      CAAN      TAAN
      D         T
      NST       NSS
      AV        AF
      Q         G
      Z         S
      M         N
      KN        NN
      K         C

 (4.2) Replace 'H' with 'A', unless it is preceded and followed by 'A'
(that is, 'AHA').
 (4.3) Replace 'AW' with 'A'.
 (4.4) Replace 'PH' with 'FF'.
 (4.5) Replace 'SCH' with 'SSS'.

 (5.0) Perform clean up functions.
 (5.1) Drop all terminal 'A' and 'S' characters.  Pad the word with
rightmost blanks as needed.  
 (5.2) Replace terminal 'NT' with 'TT'.
 (5.3) Strip out all 'A' characters, except for the leading 'A'.  Pad
the word with rightmost blanks as needed.  
 (5.4) Strip all but the first of repeating adjacent character
substrings.  Pad the word with rightmost blanks as needed.  

 (6.0) The result is the first four characters of the resulting string.

 6.3.5.3.  METAPHONE

Metaphone is another improved Soundex which first appeared in COMPUTER
LANGUAGES magazine (Miller-Freeman Publications, vol. 7 no. 12).  A
Pascal version was written by Terry Smithwick, based on the original 'C'
version by Lawrence Phillips, which we reproduce with permission here:

 FUNCTION Metaphone (p : STRING) : STRING;
 CONST
 VowelSet = ['A', 'E', 'I', 'O', 'U'];
 FrontVSet = ['E', 'I', 'Y'];
 VarSonSet = ['C', 'S', 'T', 'G'];  { variable sound -
            modified by following 'h' }

 FUNCTION SubStr (A : STRING; Start, Len : INTEGER) : STRING;
 BEGIN
 SubStr := Copy (A, Start, Len);
 END;

 FUNCTION Metaphone (p : STRING) : STRING;
 VAR
   i, l, n   : BYTE;
   silent, new  : BOOLEAN;
   last, this, next, nnext : CHAR;
   m, d : STRING;
 BEGIN { Metaphone }
 IF (p = '')
 THEN BEGIN
      Metaphone := '';
      EXIT;
      END;
 { Remove leading spaces }
 FOR i := 1 TO Length (p)
 DO p[i] := UpCase (p[i]);
 { Assume all alphas }
 { initial preparation of string }
 d := SubStr (p, 1, 2);
 IF d IN ('KN', 'GN', 'PN', 'AE', 'WR')
 THEN p := SubStr (p, 2, Length (p) - 1);
 IF (p[1] = 'X')
 THEN p := 'S' + SubStr (p, 2, Length (p) - 1);
 IF (d = 'WH')
 THEN p := 'W' + SubStr (p, 2, Length (p) - 1);
 { Set up for Case statement }
 l := Length (p);
 m := '';      { Initialize the main variable }
 new := TRUE;     { this variable only used next 10 lines!!! }
 n := 1;     { Position counter }
 WHILE ((Length (m) < 6) AND (n <> l))
 DO BEGIN { Set up the 'pointers' for this loop-around }
    IF (n > 1)
    THEN last := p[n-1]
    ELSE last := #0;  { use a nul terminated string }
    this := p[n];
    IF (n < l)
    THEN next := p[n+1]
    ELSE next := #0;
    IF ((n+1) < l)
    THEN nnext := p[n+2]
    ELSE nnext := #0;
    new := (this = 'C') AND (n > 1) AND (last = 'C');
    { 'CC' inside word }
   IF (new)
   THEN BEGIN
        IF ((this IN VowelSet) AND (n = 1))
        THEN m := this;
    CASE this OF
    'B' : IF NOT ((n = l) AND (last = 'M'))
      THEN m := m + 'B';  { -mb is silent }
  'C' : BEGIN     { -sce, i, y = silent }
     IF NOT ((last = 'S') AND (next IN FrontVSet))
     THEN BEGIN
        IF (next = 'i') AND (nnext = 'A')
        THEN m := m + 'X'{ -cia- }
        ELSE IF (next IN FrontVSet)
             THEN m := m + 'S' { -ce, i, y = 'S' }
             ELSE IF (next = 'H') AND (last = 'S')
                  THEN m := m + 'K' { -sch- = 'K' }
                  ELSE IF (next = 'H')
                       THEN IF (n = 1) AND ((n+2) < = l)
                              AND NOT (nnext IN VowelSet)
                            THEN m := m + 'K'
                            ELSE m := m + 'X';
        END { Else silent }
     END; { Case C }
 'D' : IF (next = 'G') AND (nnext IN FrontVSet)
       THEN m := m + 'J'
       ELSE m := m + 'T';
 'G' : BEGIN
    silent := (next = 'H') AND (nnext IN VowelSet);
    IF  (n > 1) AND (((n+1) = l) OR ((next = 'n') AND
      (nnext = 'E') AND (p[n+3] = 'D') AND ((n+3) = l))
 { Terminal -gned }
    AND (last = 'i') AND (next = 'n'))
    THEN silent := TRUE; { if not start and near -end or -gned.) }
    IF (n > 1) AND (last = 'D'gnuw) AND (next IN FrontVSet)
    THEN { -dge, i, y }
    silent := TRUE;
    IF NOT silent
    THEN IF (next IN FrontVSet)
         THEN m := m + 'J'
         ELSE m := m + 'K';
    END;
 'H' : IF NOT ((n = l) OR (last IN VarSonSet)) AND (next IN VowelSet)
       THEN m := m + 'H';  { else silent (vowel follows) }
 'F', 'J', 'L', 'M', 'N', 'R' : m := m + this;
 'K' : IF (last <> 'C')
       THEN m := m + 'K';
 'P' : IF (next = 'H')
       THEN BEGIN
            m := m + 'F';
            INC (n);
            END  { Skip the 'H' }
       ELSE m := m + 'P';
 'Q' : m := m + 'K';
 'S' : IF (next = 'H')
          OR ((n > 1) AND (next = 'i') AND (nnext IN ['O', 'A']))
       THEN m := m + 'X'
       ELSE m := m + 'S';
 'T' : IF (n = 1) AND (next = 'H') AND (nnext = 'O')
       THEN m := m + 'T' { Initial Tho- }
       ELSE IF (n > 1) AND (next = 'i') AND (nnext IN ['O', 'A'])
            THEN m := m + 'X'
            ELSE IF (next = 'H')
                 THEN m := m + '0'
                 ELSE IF NOT ((next = 'C') AND (nnext = 'H'))
                      THEN  m := m + 'T'; { -tch = silent }
 'V' : m := m + 'F';
 'W', 'Y' : IF (next IN VowelSet)
            THEN m := m + this;  { else silent }
 'X' : m := m + 'KS';
 'Z' : m := m + 'S';
 END; { Case }

 INC (n);
 END; { While }
 END; { Metaphone }
 Metaphone := m
 END;

 6.3.5.4.  CUTTER TABLES

 Another encoding scheme for names has been used for libraries for over
100 years. The catalog number of a book often needs to reduce an
author's name to a simple fixed length code, ...

read more »

 
 
 

1. pgsql/contrib/soundex (soundex.c soundex.sql.in)


Author: tgl

Update of /home/projects/pgsql/cvsroot/pgsql/contrib/soundex
     from hub.org:/home/projects/pgsql/tmp/cvs-serv51753/contrib/soundex

Modified Files:
        soundex.c soundex.sql.in

-----------------------------  Log Message  -----------------------------

Revise handling of oldstyle/newstyle functions per recent discussions
in pghackers list.  Support for oldstyle internal functions is gone
(no longer needed, since conversion is complete) and pg_language entry
'internal' now implies newstyle call convention.  pg_language entry
'newC' is gone; both old and newstyle dynamically loaded C functions
are now called language 'C'.  A newstyle function must be identified
by an associated info routine.  See src/backend/utils/fmgr/README.

2. Oracle on NT - Thread ID of DBWR

3. pgsql/contrib/soundex (soundex.c)

4. Newbie import from interbase to SQL

5. select soundex(string1), soundex(string2)

6. pgsql/src/test/regress (README)

7. SOUNDEX with 2 srings?

8. ADO + Access querry problem

9. Soundex and "look-alike" searches

10. soundex function

11. Soundex in a non-english Database

12. Using SOUNDEX and CONTAINS

13. SoundEx ?