A Common Problem
The data you have within the business may not be in a format that
is adequate for your needs. The data may have a mixed background coming
from historical systems, leads supplied by a business partner or information
drawn from other systems in the business.
A common problem is the name of customers or prospects in a single
field. Take a look at the sample list of contact names shown below:
- Dr Ian Lawrence Manning PhD
- Miss Jane Manning
- Dr De La Clusa
- John Smith
- Ms Rosie Parks PhD
- Ms Jane Rosie Johnson
- Miss Van Smith MD
- Dr Rosie Molly Van Percie
- Prof John Smith FRSC
- Smith
- Dr Paulo
- Rev Van Smith
- Doctor John O'Hare
- Baron Smith-Johnson
- Dr Ian Lawrence Tom Smith DPhil
To use this data, for example in a marketing campaign, you need to
split it into the constituent parts. This is required because the
address at the top of a letter may start with:
Dr Ian Manning
but the greeting should be:
Dear Dr Manning or Dear Ian
This splitting of the contact names is easily done by a human. It
is clear that the first few lines should be divided up as follows:
Title |
First Name |
Second Name |
Surname |
Post Nominal Letters |
Dr |
Ian |
Lawrence |
Manning |
PhD |
Miss |
Jane |
|
Manning |
|
Dr |
|
|
De La Clusa |
|
|
John |
|
Smith |
|
The logic may be obvious, but I will spend a few moments developing
an understanding of why we humans can do this so quickly. The steps
we take are as follows:
- See if the first field is a title. We compare with our mental
list of allowed values.
- See if the last field is a set of letters like MD, PhD, BA and
so on. These are actually called Post Nominal Letters. We compare
with a mental list of allowed values and also with some intelligent
guesses. For example PGeo is a valid Post Nominal Letter. You were
unlikely to know that, but you knew it would be a strange Surname
so would assume it is the former.
- Next split out the Surname for example recognising De La Clusa
as a surname. More accurately recognising De La as a valid start
to a multi-word surname.
- Decide that if two names remain that they will be first name and
last name and in that order.
We made a few assumptions but there is a very good chance we would
get most of the rows correctly split up.
If there are a lot of rows to process we can’t afford to do
it by hand so the big question is can we get a computer to do this
for us? I think the answer is yes. Furthermore we can examine the
resulting data and run a number of checks to reject rows with problems.
The human need then only review these few rows. In the next part of
this paper I will describe how to do this and go on to show some screens
shots from a working solution.
The Solution
The software will follow the same logic used by a human when faced
with this problem. The key features are:
- Provide a list of allowed Titles, Family Name Prefixes (e.g De
La, Van, De), and Post Nominal Letters. We also need to provide
a tool to maintain these lists.
- Provide a configuration tool to specify how many name fields and
which names fields are potentially in the data and if one or more
are missing, the order of priority they appear in.
- For each row extract any matching the allowed Titles.
- For each row extract any matching the allowed Post Nominal Letters.
- For each row extract any Family Name Prefixes.
- Divide the remaining fields into Last Name, First Name and Second
Name(s).
Having processed the data it would be sensible to run some checks
and identify any rows with potential problems. A number of tests can
be made:
- Check the Surname for format. Surnames start with capitals and
include only certain characters. The software can check for the
format allowing for example Manning and O'Hare, but not 3Maning
or OHare.
- First and Second Names obey similar, but different rules.
- Next you may want to reject rows with, for example, no Title.
You do not want to send a letter starting with Dear Smith.
Assessing the results
So does this approach work and how reliable is it? Our tests have
shown some real measure of success, but the level of success will
depend on the quality of the data. This approach relies on the fields
being in a consistent order. If the surname and first name are swapped
in the input stream it will be swapped in the output and the software
will not be able to identify this unless the subsequent tests identify
a problem.
We have a number of ideas on how to develop the logic, for example
the solving swapped order problem described above, but we need to
see more real data from real world problems. If you have a tricky
problem please contact us as we are very interested to test our approach
and enhance the approach.
The Author:
Ian Manning, a physicist, has spent a number of years working in
research, in particular building large experiments at CERN and Fermilab
handling very large volumes of data. Finding ways to make computers
match the superb analytical skills of the human mind has long been
an interest of his.
Baycastle Software Ltd uses the latest Microsoft® technology
to build 1st class software tools and utilities for real businesses.
Baycastle products are simple to use and give our customers access
to levels of productivity never seen before. DataSlave™ is
a market-leading tool for loading, checking, transforming and migrating
data.
www.baycastle.co.uk