community.borland.com

Article #19301: How to do a "sounds like" query in Interbase

 Technical Information Database

TI4301C.txt   :How to do a "sounds like" query in Interbase
Category   :Database Issues
Platform    :Win95/NT
Product    :All-CBuilder,   

Description:
Imagine you could run a query on an Interbase table that would return a set
of values that _sound_ like your query. It is possible through the magic of 
UDFs (user defined functions) and an algotrithm called SoundEx. 
Using a very simple algorithm in a DLL, you can
declare a function that assigns a numeric code based on the way a word "sounds".
This code can be compared against your query string to return an accurate
(based on the simplicity of the algorithm) set. 

----How it works----
SoundEx classifies words for how they sound by asssigning a short numeric
code based on a group of letters.

Take the first letter. 
Then, for each of the subsequent letters, score a code for the group it 
belongs to. There are 6 groups : 

Group      Letters
1          BFPV

2          CGJKQSXZ

3          DT

4          L

5          MN

6          R

Ignore all other letters and characters (vowels, h, punctuation, etc.)
If a code has already occurred, ignore it. Continue until you have 4 characters,
if there are less, add zeroes until you have 4.

Thus 
Smith becomes S5300
Smitt becomes S5300
Smythe becomes S5300

And they all sound the same or similar.

---------Implementation---------
In order to get Interbase to use the Soundex function we must do the following:

Write the function in C++ or Delphi.
Encapsulate the function in a DLL.
Declare the function and DLL to interbase as a UDF.
Create a column on our table to keep track of the SoundEx codes.
Add a trigger to update the data.

The Delphi function was originally written by John Midwinter and was implemented to
return not the SoundEx 4 character code, but a number in a Byte (SmallInt).
He did this by realising that, though the natural Soundex Code has 4 characters, 
it only has 26 * 6 * 6 * 6 (=5,616) possible values which can be held by a small int.
And therefore chose to store it as a number rather than a character string. Below
I have provided both Delphi and C++Builder source code.

---------C++ Function-----------
extern "C"
int __declspec(dllexport) SoundBts(char *Name)
{
    unsigned char SoundExTable[59] =
    //A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ / ] ^ _ '
     {0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2,0,0,0,0,0,0,
    //a b c d e f g h i j k l m n o p q r s t u v w x y z
      0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2};
    //This table is used to keep track of what number goes with what letter.
    Byte i,s,SO,x;
    int Multiple, Result;
    int len;
    if (Name != "")
    {
        Result = (short)UpCase(Name[0]);  //initialize to first character
        SO = 0;                     //initialize last char to 0
        Multiple = 26;    //26 letters in the alphabet
        len = (StrLen(Name));  //length of word
        for (i=0; i 64) && (s < 123)) //if it is a letter
            {
                x = SoundExTable[((int)s - 65)];  //get the SoundEx code
                if ((x > 0) && (x != SO))
                {
                    Result += (x * Multiple);
                    if (Multiple == (26 * 6 * 6 * 6))
                        break;                 //we have a sufficient code
                    Multiple *= 6;
                    SO = x;
                }
            }
        }
    }
    else return 0;
    return Result;           //Finished SoundEx code
}
-------------------------------------------------------------------
------------Delphi Function by John Midwinter--------------------------
library SoundBytes;

uses
  SysUtils;

function SoundBts(Name : PChar) : SmallInt; cdecl;
Const
{This table gives the SoundEX SCORE for all characters Upper and Lower Case
hence no need to convert. This is faster than doing an UpCase on the whole input string
The 5 NON Chars in middle are just given 0}

SoundExTable : Array[65..122] Of Byte
//A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ / ] ^ _ '
=(0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2,0,0,0,0,0,0,
//a b c d e f g h i j k l m n o p q r s t u v w x y z
  0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2);

Var
  i, l, s, SO, x : Byte;
  Multiple : Word;

begin
  If Name >  ''                           //do nothing if nothing is passed
  then begin
    Result := Ord(UpCase(Name[0]));       //initialise to first character
    SO := 0;                              //initialise last char as 0
    Multiple := 26;                       //initialise to 26 char of alphabet
    l := Pred(StrLen(Name));              //get into var to save repeating function
    For i := 1 to l do                    //for each char of input str
    begin
      s := Ord(name[i]);                  //*
      If (s > 64) and (s < 123)           //see notes * below
      then begin
        x := SoundExTable[s];             //get soundex value
        If (x > 0)                        //it is a scoring char
        AND (x <> SO)                     //is different from previous char
        then begin
          Result := Result + (x * Multiple); //avoid use of POW as it needs maths unit
          If (Multiple = 26 * 6 * 6)      //we have done enough (NB compiles to a const
           then break;                    //We have done, so leave loop
          Multiple := Multiple * 6;
          SO := x;                        //save for next round
        end;                              // of if a scoring char
      end;                                //of if in range of SoundEx table
    end;                                  //of for loop
  end else result := 0;
end;                                      //of function SoundBts

exports                                   
  SoundBts;
begin
end.
--------------------------------------------------------------------------
Putting the function into Interbase.
For this example, I have included an SQL script. The table we are updating is called
People.

---------soundex.sql--------
CONNECT MyDatabase.gdb USER SYSDBA PASSWORD mypass ;

DECLARE EXTERNAL FUNCTION SoundBytes Char(30)
RETURNS INTEGER BY VALUE              //function declaration
ENTRY_POINT "SoundBts"                //function name (DLL entry point)
MODULE_NAME "soundex.dll" ;           //name and path of the dll

CREATE DOMAIN SoundBytesDomain        //here we create a domain for the type
INTEGER                               //that the code column will hold
DEFAULT 0
NOT NULL;

ALTER TABLE PEOPLE                    //add a column to hold the SoundEx codes
ADD SOUNDBYTES INTEGER NOT NULL;

UPDATE PEOPLE                         //get the SoundEx codes into SOUNDBYTES
SET SOUNDBYTES = SoundBytes(NAME);    //from the NAME column

CREATE INDEX SoundByteIndex           //put an index on the table
ON PEOPLE (SOUNDBYTES);

SET TERM !! ;                         //create a trigger to update the table with
CREATE TRIGGER SoundBytes_Update FOR PEOPLE //new soundex codes
BEFORE UPDATE AS
BEGIN
  NEW.SOUNDBYTES = SoundBytes(NEW.NAME);
END !!
SET TERM ; !!

EXIT ;
--------------------------------------------------------------

--------Using the function in your application-----------
There are several ways to utilize this function. You would
use it in pseudo-SQL like this:

select * from MY_TABLE where MY_FIELD sounds like MY_QUERY

One of the ways is to have the server execute the SoundBts function using a select 
statement:
select NAME from PEOPLE where
  SOUNDBYTES = SoundBytes("John")

There can be problems with this method, however - it can GPF because of a 
DLL path discrepancy.

Another way is to generate the SoundEx code within your app by importing the function
from the DLL. This is quicker and you simply have to pass the number, rather than
executing the UDF on the server. 

The code might look like this:

//after importing the function, use it
SoundExCode = SoundBts("MySearchString");

//using a TQuery connected to a TDatabase attached to the Interbase database:
//the SQL property of the query looks like this:
Select Name from People where
 soundbytes = :SXCode

//then all you need to do is this:

Query1->Prepare();
Query1->ParamByName("SXCode")->AsInteger = SoundExCode;
Query1->Open();
----------------------------------------------------

This is the magic of SoundEx and Interbase UDFs. This is very simple, and the function
can be fine-tuned to get closer matches (say, if your table had a whole lot of similar-
sounding names).


----------------FIN-------------------

Reference:
None

2/23/99 1:59:49 PM
 

Last Modified: 01-SEP-99