databasedev.co.uk - database solutions and downloads for microsoft access

Resolving Contact Names into their Constituent Parts

A Common Problem

The data you have within the business may not be in a format that is adequate for your needs. The data may have a mixed background coming from historical systems, leads supplied by a business partner or information drawn from other systems in the business.

A common problem is the name of customers or prospects in a single field. Take a look at the sample list of contact names shown below:

  • Dr Ian Lawrence Manning PhD
  • Miss Jane Manning
  • Dr De La Clusa
  • John Smith
  • Ms Rosie Parks PhD
  • Ms Jane Rosie Johnson
  • Miss Van Smith MD
  • Dr Rosie Molly Van Percie
  • Prof John Smith FRSC
  • Smith
  • Dr Paulo
  • Rev Van Smith
  • Doctor John O'Hare
  • Baron Smith-Johnson
  • Dr Ian Lawrence Tom Smith DPhil

To use this data, for example in a marketing campaign, you need to split it into the constituent parts. This is required because the address at the top of a letter may start with:

Dr Ian Manning

but the greeting should be:

Dear Dr Manning or Dear Ian

This splitting of the contact names is easily done by a human. It is clear that the first few lines should be divided up as follows:

Title First Name Second Name Surname Post Nominal Letters
Dr Ian Lawrence Manning PhD
Miss Jane Manning
Dr De La Clusa
  John Smith

The logic may be obvious, but I will spend a few moments developing an understanding of why we humans can do this so quickly. The steps we take are as follows:

  • See if the first field is a title. We compare with our mental list of allowed values.
  • See if the last field is a set of letters like MD, PhD, BA and so on. These are actually called Post Nominal Letters. We compare with a mental list of allowed values and also with some intelligent guesses. For example PGeo is a valid Post Nominal Letter. You were unlikely to know that, but you knew it would be a strange Surname so would assume it is the former.
  • Next split out the Surname for example recognising De La Clusa as a surname. More accurately recognising De La as a valid start to a multi-word surname.
  • Decide that if two names remain that they will be first name and last name and in that order.

We made a few assumptions but there is a very good chance we would get most of the rows correctly split up.

If there are a lot of rows to process we can’t afford to do it by hand so the big question is can we get a computer to do this for us? I think the answer is yes. Furthermore we can examine the resulting data and run a number of checks to reject rows with problems. The human need then only review these few rows. In the next part of this paper I will describe how to do this and go on to show some screens shots from a working solution.

The Solution

The software will follow the same logic used by a human when faced with this problem. The key features are:

  • Provide a list of allowed Titles, Family Name Prefixes (e.g De La, Van, De), and Post Nominal Letters. We also need to provide a tool to maintain these lists.
  • Provide a configuration tool to specify how many name fields and which names fields are potentially in the data and if one or more are missing, the order of priority they appear in.
  • For each row extract any matching the allowed Titles.
  • For each row extract any matching the allowed Post Nominal Letters.
  • For each row extract any Family Name Prefixes.
  • Divide the remaining fields into Last Name, First Name and Second Name(s).

Having processed the data it would be sensible to run some checks and identify any rows with potential problems. A number of tests can be made:

  • Check the Surname for format. Surnames start with capitals and include only certain characters. The software can check for the format allowing for example Manning and O'Hare, but not 3Maning or OHare.
  • First and Second Names obey similar, but different rules.
  • Next you may want to reject rows with, for example, no Title. You do not want to send a letter starting with Dear Smith.

Assessing the results

So does this approach work and how reliable is it? Our tests have shown some real measure of success, but the level of success will depend on the quality of the data. This approach relies on the fields being in a consistent order. If the surname and first name are swapped in the input stream it will be swapped in the output and the software will not be able to identify this unless the subsequent tests identify a problem.

We have a number of ideas on how to develop the logic, for example the solving swapped order problem described above, but we need to see more real data from real world problems. If you have a tricky problem please contact us as we are very interested to test our approach and enhance the approach.

The Author:

Ian Manning, a physicist, has spent a number of years working in research, in particular building large experiments at CERN and Fermilab handling very large volumes of data. Finding ways to make computers match the superb analytical skills of the human mind has long been an interest of his.

Baycastle Software Ltd uses the latest Microsoft® technology to build 1st class software tools and utilities for real businesses. Baycastle products are simple to use and give our customers access to levels of productivity never seen before. DataSlave™ is a market-leading tool for loading, checking, transforming and migrating data.

www.baycastle.co.uk