|
Spelling Errors
Spelling and keyboard errors account for many of the variations in
a database. Through the use of intelligent key building
and advanced comparison routines NameSearch® successfully overcomes
spelling errors including: multiple typos, letter transpositions, incomplete
words, etc.
Variations due to spelling
only, matched and scored by NameSearch®:
| Input: Richard Wagner |
| Rihcard Wagne |
097 |
| Ricard Waner |
097 |
| Richart Wagnar |
088 |
| Richart Wagnar |
088 |
| Rickard Wackner |
085 |
| Ritchart Wagma |
080 |
Missing,
extra, noise words
The rulebase is used to identify noise
words. Noise words are elements in a name that do not help in the
identification of a candidate. Examples of noise words are: Incorporated,
Corporation, Limited, Junior, Senior, Avenue and Street.
While processing the data NameSearch® goes
through a process called sanitization that removes
noise characters, extra spaces, control characters
and converts lower case letters to uppercase. Examples
of noise characters are: @, #. $, %, ^, &, *,
(, ), }, {, [, ]. The following characters are handled
separately and have special meanings: commas, hyphens
and quotes. Commas usually indicate the insertion
of a last name. Sanitization places words followed
by commas at the end of the string. Quotes are deleted
and the space between them is removed. A space replaces
the hyphens.
| Before Sanitization |
After Sanitization |
| Scott Lions |
SCOTT LIONS |
| Smith, John F. |
JOHN F SMITH |
| Rose Stone-Shield |
ROSE STONE SHIELD |
| James O'Tool |
JAMES OTOOL |
| James O. Tool |
JAMES OTOOL |
| Owen, Tool, James |
JAMES OWEN TOOL |
| # Williams , $Richard |
RICHARD WILLIAMS |
The sanitization process also uses a small rulebase.
The rulebase is applied after all the alpha characters have been
converted to upper case letters and extra blanks are removed. This
rulebase is used to recognize words that contain noise characters
or prefixes that could be effected by the sanitization process.
| Before Sanitization |
After Sanitization |
Sanitization (without rulebase
expertise) |
| c\o |
CARE OF |
C O |
| Mc Donald, Old |
OLD MCDONALD |
MC OLD DONALD |
| % |
CARE OF |
|
Rulebase
expertise
The rulebase expert system is used to identify nicknames.
Entities such as Bill, William, Bob and Robert are used interchangeably
to identify
individuals. The rulebase is also used to identify noise
words. Noise words are elements in a name that do not help in the
identification
of a candidate. Examples of noise words are: Incorporated,
Corporation, Limited, Junior, Senior, Avenue and Street. Often there
are times where
elements in a name contribute to the identity but should
be treated as less important. In these cases, the rulebase does not
treat them
as noise words but recognizes that they are less significant.
Some examples are: associate, board, international and services. Other
variations
are caused by the use of common prefixes. Names like McDonnell,
are confused with MacDonnell. Prefix recognition provides the facility
for
handling these classes of problems. The rulebase can also
recognize diminutives. Frequently there are names which end in a diminutive
such
as "ie" or "y". In these cases, it is useful to
identify the root and apply the rule. For example, you would
want Bill, Billie and Billy to find William or Willie.
| BILL YARA |
WILLIAM YARA |
| BOBBY KENNEDY |
ROBERT KENNEDY |
| JIM P PHILLIPS SR |
JAMES P PHILLIPS |
| SMITH AND ASSOCIATES |
SMITH |
| MCDONELL CORPORATION |
MCDONELL |
| MR MATT J THOMAS |
MATTHEW J THOMAS |
| MARINA DELSOLE |
MARINA DEL SOLE |
| DR LEONARD MACCOY MD |
LEONARD MCCOY |
For example, the personal name rulebase helps match
Robert, Rob and Bobby. The street service gives less significance
to
the words Road, Avenue and Blvd in the address. The company
names service ignores the words Corporation, Inc. and Corp:
| Name |
Address |
Company |
| Robert Wagner |
24 Milltown Road |
Smith Corporation |
| Rob Wagner |
24 Milltown Avenue |
Smith Inc. |
| Bobby Wagner |
24 Milltown Blvd |
Smith Corp Inc. |
Phonetic
Errors
Discrepancies caused by phonetic
errors account for 20-25% of all name variations.
Traditional solutions such as Soundex
and NYSIIS used for solving name variations only deal with
phonetic errors. These solutions involved the standardization
of easily confused sounds. For example, PH's would be treated
as F's. Linguistic rules were generated to phonetically
tokenize a name. These phonetically tokenized words served
as the basis for name retrieval. In some instances these
rules helped find names that were hard to spell, unfortunately,
the distribution pattern of common names became even more
skewed. For example, inquiries on John also returned Joan,
Jim, Jane, Jimmy, Jenn and other names which fell in the "JAN" phonetic
pattern. By aggravating the skew in distribution of names
both quality and performance
were sacrificed.
NameSearch® addresses problems due to phonetics
by employing analysis routines to determine the extent of phonetic tokenization. This
enables NameSearch® to overcome problems due to phonetics without
the negative
consequences incurred with all other methods of name search.
For example, the following variations are caught with the
help of phonetics:
| Name |
| Phillip Mac Affik |
| Filip Mkafic |
| Philip Mackaphik |
Examples of phonetic tokenization: (taken
directly from Robert L. Taft, "Name
Search
Techniques",
New York State Identification and Intelligence):
| 1) Translate first characters of name |
| |
MAC => MCC |
| |
PH => FF |
| |
KN => NN |
| |
K => C |
| |
SCH => SSS |
| 2) Translate last characters of name |
| |
EE => Y |
| |
IE => Y |
| |
DT,RT,RD,NT,ND => D |
| 3) First character of key = first character of name |
| 4) Translate remaining characters by following rules, incrementing
by one character each time |
| |
EV => AF else A,E,I,O,U => A |
| |
Q => G |
| |
Z => S |
| |
M => N |
| |
KN => N else K => C |
| |
SCH => SSS |
| |
PH => FF |
| |
H => If previous or next is non vowel, previous |
| |
W => If previous is vowel, previous |
| 5) Translate last characters of name |
| |
If last character is S, remove it |
| |
If last characters are AY, replace with Y |
| |
If last character is A, remove it |
Word
Sequence Variations
Many search problems are caused by sequence variations. The inability
to determine the order of words for a particular entity occurs
at both data entry and inquiry time. The name Frank Lee for example,
could have been Lee Frank. This problem is particularly pervasive
in company names. Names such as International Business Machines,
Anderson Consulting and Kemper Insurance Company are examples
where the left-most word is most significant. Conversely, Edward
S. Gordan Real Estate Company and Paul Mitchell hair products
are examples where the left-most word is less significant. The
inability to predict the significant name with respect to word
position causes many searches to fail.
For example, the different permutations
of the words in the input are matched:
| Name |
Address |
Company |
| Ricky Scott Wagner |
24 West Jones Avenue |
Jones and Smith Corporation |
| Scott Rick Wagner |
24 Jones Avenue West |
Smith and Jones Corporation |
| Wagner Rick Scortt |
24 Avenue Jones West |
Corporation Jones Smith |
Merging foreign database files causes other sequence
variations. This frequently occurs when external lists are
purchased or
companies consolidate information. Inconsistent methodologies
for data capture make the standardization of name fields impossible.
Aggravating the sequence problem are those instances in which
company names are intermixed with personal names. All of these
factors, in addition to human error, contribute to identification
problems caused by sequence variations. NameSearch® provides
a facility for handling these problems.
To understand this better we will draw
an analogy between a telephone book and a database system.
When we look
for Frank Lee we search the "L" section. If the name
is not there, we continue the search by looking in the "F" section.
In order to find Frank Lee we had to search two separate sections
of the phone book. Suppose we were looking for Frank Lee Ray.
To ensure success we must search all the permutations. This
is an extremely arduous and time consuming process for both
people and computers. By listing Frank Lee in both the L and
F sections, regardless of order, only one section would need
to be searched.
Acronym
Recognition
Corporate name searching concretely illustrates the pragmatic
difficulties in developing solutions that find correct
information without missing likely
candidates. People readily understand the similarities between "Triple
A towing" and "AAA towing" yet computerized systems would
need to employ a knowledge based algorithm to recognize the relationship
between Triple A and AAA.
For example, IST is recognized as an acronym for Intelligent
Search Technology:
| Company |
| Intelligent Search Technology |
| IST |
| IS Technology |
The deployment of intelligence through knowledge based
systems greatly benefits search and matching algorithms
by identifying nicknames, shortened forms,
noise words and other circumstances that require experience to return a
more comprehensive result set. However, knowledge based
systems are limited by
the breadth and depth of their lexicon. Contrary to names such as IBM and
AT&T, the vast majority of acronyms lie outside the scope of knowledge
base processing. For example, our clients often used the IST acronym interchangeably
with Intelligent Search Technology yet it would be unreasonable to expect
the inclusion of IST in a knowledge based system.
The NameSearch® software with its corporate search algorithms
and acronym recognition functionality significantly advances the
ability
to seek and match corporate name data.
|