Technical Information Database TI4301C.txt :How to do a "sounds like" query in Interbase Category :Database Issues Platform :Win95/NT Product :All-CBuilder, Description: Imagine you could run a query on an Interbase table that would return a set of values that _sound_ like your query. It is possible through the magic of UDFs (user defined functions) and an algotrithm called SoundEx. Using a very simple algorithm in a DLL, you can declare a function that assigns a numeric code based on the way a word "sounds". This code can be compared against your query string to return an accurate (based on the simplicity of the algorithm) set. ----How it works---- SoundEx classifies words for how they sound by asssigning a short numeric code based on a group of letters. Take the first letter. Then, for each of the subsequent letters, score a code for the group it belongs to. There are 6 groups : Group Letters 1 BFPV 2 CGJKQSXZ 3 DT 4 L 5 MN 6 R Ignore all other letters and characters (vowels, h, punctuation, etc.) If a code has already occurred, ignore it. Continue until you have 4 characters, if there are less, add zeroes until you have 4. Thus Smith becomes S5300 Smitt becomes S5300 Smythe becomes S5300 And they all sound the same or similar. ---------Implementation--------- In order to get Interbase to use the Soundex function we must do the following: Write the function in C++ or Delphi. Encapsulate the function in a DLL. Declare the function and DLL to interbase as a UDF. Create a column on our table to keep track of the SoundEx codes. Add a trigger to update the data. The Delphi function was originally written by John Midwinter and was implemented to return not the SoundEx 4 character code, but a number in a Byte (SmallInt). He did this by realising that, though the natural Soundex Code has 4 characters, it only has 26 * 6 * 6 * 6 (=5,616) possible values which can be held by a small int. And therefore chose to store it as a number rather than a character string. Below I have provided both Delphi and C++Builder source code. ---------C++ Function----------- extern "C" int __declspec(dllexport) SoundBts(char *Name) { unsigned char SoundExTable[59] = //A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ / ] ^ _ ' {0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2,0,0,0,0,0,0, //a b c d e f g h i j k l m n o p q r s t u v w x y z 0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2}; //This table is used to keep track of what number goes with what letter. Byte i,s,SO,x; int Multiple, Result; int len; if (Name != "") { Result = (short)UpCase(Name[0]); //initialize to first character SO = 0; //initialize last char to 0 Multiple = 26; //26 letters in the alphabet len = (StrLen(Name)); //length of word for (i=0; i64) && (s < 123)) //if it is a letter { x = SoundExTable[((int)s - 65)]; //get the SoundEx code if ((x > 0) && (x != SO)) { Result += (x * Multiple); if (Multiple == (26 * 6 * 6 * 6)) break; //we have a sufficient code Multiple *= 6; SO = x; } } } } else return 0; return Result; //Finished SoundEx code } ------------------------------------------------------------------- ------------Delphi Function by John Midwinter-------------------------- library SoundBytes; uses SysUtils; function SoundBts(Name : PChar) : SmallInt; cdecl; Const {This table gives the SoundEX SCORE for all characters Upper and Lower Case hence no need to convert. This is faster than doing an UpCase on the whole input string The 5 NON Chars in middle are just given 0} SoundExTable : Array[65..122] Of Byte //A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ / ] ^ _ ' =(0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2,0,0,0,0,0,0, //a b c d e f g h i j k l m n o p q r s t u v w x y z 0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2); Var i, l, s, SO, x : Byte; Multiple : Word; begin If Name > '' //do nothing if nothing is passed then begin Result := Ord(UpCase(Name[0])); //initialise to first character SO := 0; //initialise last char as 0 Multiple := 26; //initialise to 26 char of alphabet l := Pred(StrLen(Name)); //get into var to save repeating function For i := 1 to l do //for each char of input str begin s := Ord(name[i]); //* If (s > 64) and (s < 123) //see notes * below then begin x := SoundExTable[s]; //get soundex value If (x > 0) //it is a scoring char AND (x <> SO) //is different from previous char then begin Result := Result + (x * Multiple); //avoid use of POW as it needs maths unit If (Multiple = 26 * 6 * 6) //we have done enough (NB compiles to a const then break; //We have done, so leave loop Multiple := Multiple * 6; SO := x; //save for next round end; // of if a scoring char end; //of if in range of SoundEx table end; //of for loop end else result := 0; end; //of function SoundBts exports SoundBts; begin end. -------------------------------------------------------------------------- Putting the function into Interbase. For this example, I have included an SQL script. The table we are updating is called People. ---------soundex.sql-------- CONNECT MyDatabase.gdb USER SYSDBA PASSWORD mypass ; DECLARE EXTERNAL FUNCTION SoundBytes Char(30) RETURNS INTEGER BY VALUE //function declaration ENTRY_POINT "SoundBts" //function name (DLL entry point) MODULE_NAME "soundex.dll" ; //name and path of the dll CREATE DOMAIN SoundBytesDomain //here we create a domain for the type INTEGER //that the code column will hold DEFAULT 0 NOT NULL; ALTER TABLE PEOPLE //add a column to hold the SoundEx codes ADD SOUNDBYTES INTEGER NOT NULL; UPDATE PEOPLE //get the SoundEx codes into SOUNDBYTES SET SOUNDBYTES = SoundBytes(NAME); //from the NAME column CREATE INDEX SoundByteIndex //put an index on the table ON PEOPLE (SOUNDBYTES); SET TERM !! ; //create a trigger to update the table with CREATE TRIGGER SoundBytes_Update FOR PEOPLE //new soundex codes BEFORE UPDATE AS BEGIN NEW.SOUNDBYTES = SoundBytes(NEW.NAME); END !! SET TERM ; !! EXIT ; -------------------------------------------------------------- --------Using the function in your application----------- There are several ways to utilize this function. You would use it in pseudo-SQL like this: select * from MY_TABLE where MY_FIELD sounds like MY_QUERY One of the ways is to have the server execute the SoundBts function using a select statement: select NAME from PEOPLE where SOUNDBYTES = SoundBytes("John") There can be problems with this method, however - it can GPF because of a DLL path discrepancy. Another way is to generate the SoundEx code within your app by importing the function from the DLL. This is quicker and you simply have to pass the number, rather than executing the UDF on the server. The code might look like this: //after importing the function, use it SoundExCode = SoundBts("MySearchString"); //using a TQuery connected to a TDatabase attached to the Interbase database: //the SQL property of the query looks like this: Select Name from People where soundbytes = :SXCode //then all you need to do is this: Query1->Prepare(); Query1->ParamByName("SXCode")->AsInteger = SoundExCode; Query1->Open(); ---------------------------------------------------- This is the magic of SoundEx and Interbase UDFs. This is very simple, and the function can be fine-tuned to get closer matches (say, if your table had a whole lot of similar- sounding names). ----------------FIN------------------- Reference: None 2/23/99 1:59:49 PM
Last Modified: 01-SEP-99