The data you have within the business may not be in a format that is adequate for your needs. The data may have a mixed background coming from historical systems, leads supplied by a business partner or information drawn from other systems in the business.
A common problem is the name of customers or prospects in a single field. Take a look at the sample list of contact names shown below:
To use this data, for example in a marketing campaign, you need to split it into the constituent parts. This is required because the address at the top of a letter may start with:
Dr Ian Manning
but the greeting should be:
Dear Dr Manning or Dear Ian
This splitting of the contact names is easily done by a human. It is clear that the first few lines should be divided up as follows:
|Title||First Name||Second Name||Surname||Post Nominal Letters|
|Dr||De La Clusa|
The logic may be obvious, but I will spend a few moments developing an understanding of why we humans can do this so quickly. The steps we take are as follows:
We made a few assumptions but there is a very good chance we would get most of the rows correctly split up.
If there are a lot of rows to process we can’t afford to do it by hand so the big question is can we get a computer to do this for us? I think the answer is yes. Furthermore we can examine the resulting data and run a number of checks to reject rows with problems. The human need then only review these few rows. In the next part of this paper I will describe how to do this and go on to show some screens shots from a working solution.
The software will follow the same logic used by a human when faced with this problem. The key features are:
Having processed the data it would be sensible to run some checks and identify any rows with potential problems. A number of tests can be made:
So does this approach work and how reliable is it? Our tests have shown some real measure of success, but the level of success will depend on the quality of the data. This approach relies on the fields being in a consistent order. If the surname and first name are swapped in the input stream it will be swapped in the output and the software will not be able to identify this unless the subsequent tests identify a problem.
We have a number of ideas on how to develop the logic, for example the solving swapped order problem described above, but we need to see more real data from real world problems. If you have a tricky problem please contact us as we are very interested to test our approach and enhance the approach.