How
NameSearch® Works - Intelligent Search Keys
Retrieval of information from your database is achieved
by the insertion of a search keys produced from the NameSearch® product.
At inquiry time NameSearch® accepts an input string as a parameter and
returns ranges that will be used to find records whose search keys
fall between those ranges.
Manufacturing Name-Keys
The goal of producing an intelligent search key
is to improve the quality of records being returned without sacrificing
performance. To improve performance we minimize the number of
records being processed. If we optimize for performance there
is a good possibility good candidates are missed. On the other
hand, we suffer a performance degradation when the volume of
records being processed is increased to ensure quality.
NameSearch® makes the balancing act easy by
always finding the records of interest in the smallest set without
missing likely candidates. By finding only those records of interest
NameSearch® dramatically reduces I/O utilization.
Components of an Intelligent Key
To overcome input string variations caused by phonetics,
transcription, keyboarding errors, nicknames, short forms, missing
words, extra words, noise and sequencing differences NameSearch® employs
four sub functions to produce a key. These are: sanitization,
word pattern recognition, phonetic tokenization and key production.
These modules receive an input string from the calling program
and internally manipulate the data. At the conclusion of the
process,
a key
or a set of keys is returned.
Search keys are built after sanitization, word recognition
and phonetic tokenization. Every database record must contain or be indexed
by at least one NameSearch® key. A key loading utility must be written
to populate an index or database. The utility will sequentially read
records, pass the names to the NameSearch® key building function and store
the returned keys.
Many search problems are caused by sequence variations.
The inability to determine the order of words for a particular entity
occurs at both data entry and inquiry time. The name Frank Lee for example,
could have been Lee Frank. This problem is particularly pervasive in
company names. Names such as International Business Machines, Anderson
Consulting and Kemper Insurance Company are examples where the left-most
word is most significant. Conversely, Edward S. Gordan Real Estate Company
and Paul Mitchell hair products are examples where the left-most word
is less significant. The inability to predict the significant name with
respect to word position causes many searches to fail.
Merging foreign database files causes other sequence variations.
This frequently occurs when external lists are purchased or companies
consolidate information. Inconsistent methodologies for data capture
make the standardization of name fields impossible. Aggravating
the sequence problem are those instances, in which company names are
intermixed with
personal names. All of these factors, in addition to human error,
contribute to identification problems caused by sequence variations.
NameSearch® provides a facility for handling these problems. A set of permuted
keys is returned after the call to the key building function. In order
to
solve search problems caused by sequence variation the permuted
keys will be used to index your database.
To understand how these keys are used we will draw an analogy
between a telephone book and a database system. When we look for Frank
Lee we search the "L" section. If the name is not there, we
continue the search by looking in the "F" section. In order
to find Frank Lee we had to search two separate sections of the phone
book. Suppose we were looking for Frank Lee Ray. To ensure success we
must search all the permutations. This is an extremely arduous and time
consuming process for both people and computers.
By listing Frank Lee in both the L and F sections, regardless of order, only
one section would need to be searched. The one disadvantage of storing multiple
listings is the expense of storage.
How NameSearch® works
|