X4DB data quality framework
"Typical big-business database has grown a hundredfold in size in the past five years"



Article published in DM Direct Newsletter
By Nancy Rybeck

As corporate IT departments across the planet struggle to implement customer relationship management (CRM), business intelligence (BI) and data warehousing (DW), trade magazines and convention presentations echo the same painful stories. Years and dollars later, the systems are in place but the results are not worth the effort. The most common underlying reason - poor data quality.

Quite frankly, most sane people don't find cleansing data anymore fun than cleaning the toilet. When given the time, analysts are happy to look at data to find patterns of error, programmers are happy to code validations when they are requested and users try to enter things correctly, but mistakes still happen. Only the most compulsive among us will take the time to research a data problem, determine if and where bad data elements might be in use and find a way to correct it. So, clean data moves in and out of the system smoothly, but dirty data hangs around like laundry on a teenager's floor.

System implementation projects are planned without allocating enough time to cleanse dirty data. Often the existence of dirty data is acknowledged, and cleanup may be attempted at some level. Most often though, most dirty data is converted, loading new systems with all the old data problems, soon to be joined by a whole crop of new data anomalies. New operational systems simply create bad data faster. Often, CRM systems sitting on top of them are, at best, no improvement over old systems and, at worst, they are dismal, expensive failures.

Industry experts and vendors alike were quick to see this trend, publicize it and tout solutions. Most articles and presentations focus on one or two aspects of data quality. I thought it might be helpful to attempt to pull together both the most common lessons learned, as well as some rules to use as a starting point for those just beginning or those beginning again. I have been deeply imbedded in these activities, and I've had lots of opportunity to talk with others doing similar things. Here's a summary of the most common observations.

Key to Success

The key to successful customer CRM, BI and data warehousing (DW) ventures is data quality. The most elegant system in the world will fail without it. The simplest systems can deliver amazing results if they operate on clean data.

Cleaning is Ongoing

To achieve a high level of data quality requires three efforts: initial cleansing, error prevention and ongoing cleansing. Of these, the most important is ongoing cleansing. Without it, your shiny new system will gradually corrode and fall into ruin.

Unfortunately, most efforts focus on initial cleanup and then die out. Imagine what your house would look like if you only cleaned it when you moved in.

Integrate Quality Efforts

Data quality must be integrated into business process. You can expect users to have the desire to maintain high quality, and they will have tolerance for some measure of additional effort to do so. How much tolerance will depend on your organization's ability to demonstrate the benefits of CRM, BI and/or DW to them, and your ability to provide online validations to prevent mistakes.

Unfortunately, you should not expect your IT department to embrace this. As deadlines approach, quality slips silently off the radar. If left to IT, data quality will lay there in the dust (along with documentation tasks) unless a special effort is made to rescue it. If asked, most IT folks believe this responsibility lies entirely with the user community and requires no effort from them.

Data Analysts and Administrators

Most people will agree that data quality is important; however, people who are truly passionate about data quality are rare. I usually hear them described in less complimentary terms, but I've chosen "rare" because I haven't met many of them. I am one who has been masquerading as a data warehouse architect for some time. The most successful data quality efforts are managed by one of these people.


To achieve a sustained high data quality level requires tools and time. The complexity of these tools and the amount of time required to keep data clean is consistently and significantly underestimated by everyone except those whose passion is data quality. If you are dealing with international data, the complexity is nearly incomprehensible for any one individual.

Manage Expectations

No matter what you do or how well you do it, perfection is not attainable. Managing this expectation across IT management, and to a lesser extent across the user community, is critical. Each pass made to the data results in successive approximations of perfection - if your dataset was static, you might achieve near perfection. In real life, if you work really hard and have a bit of luck, you can hope to maintain a state where you can clean data slightly faster than it gets screwed up.

Given this somewhat irreverent and admittedly dismal view of the data quality world, how would you start? There are lots of consultants and vendors who will be happy to help you. There may be project managers in your organization who have some ideas as well. But the reality is that your data is unique to your organization and the way you handle it should be as well. The deck is stacked against you - there is no "one size fits all" solution; if there were, you probably wouldn't have a data quality problem in the first place.

In addition to the observations I've made, I have come up with some rules we wish someone had given us before we started. For those of you who like numbers, I've attempted to quantify a few of these things. Much of my data quality experience has involved company and person names and addresses, and they make good examples since nearly everyone has customers. I considered calling these guidelines, and you may consider them that way, as long as you realize that there is a logarithmic relationship between breaking them and future pain points.


Establish standard data values. Whenever possible, use someone else's standards. Examples: ISOC, SIC, NAICS or already existing company values all work.

Standard data values must be pure. For the database architects and DBAs among you, apply the same principles to the data that you do to table attributes or fields in files. Don't use one field to hold multiple values. For example, if your operational system contains different types of entities such as customers, people, vendors, prospects, intercompany locations or competitors all in a single master database, you will probably have some standard code that is linked to a record to categorize it. Make sure that you design your database (or implement your purchased application) such that a single entity falling into more than one category does not require that you create a combination code or to duplicate data. If you sell to a company that also sells to you, your system must support the ability for two codes to be associated to the record for the company. Otherwise, you will either have to create a new code for those companies who are "customers and vendors" or you will have to create duplicate records, each with their own code. In either case, you have created a maintenance and data quality headache.

International Data

If you have data being entered from international locations, make sure your database supports all the character sets you will need and that users can view the data correctly.

Associated with this issue, you will also need to decide if your data will be stored in mixed case or all upper case. Be aware that if you chose to store it in all upper case for ease of processing and you intend to use it to create mailings or imbed data in e-mails, you will need to have a way to convert it back to mixed case. This is reasonably uncomplicated in English but not so in other languages. For example, in French upper case, accents can be left off by convention, but in lower case, leaving an accent off makes the word incorrect.

A Place for Everything

The database should contain fields for all the different types of data. For example, a state field is common, but it won't hold provinces, districts and counties for countries that need more than one of them.

Print formats will differ between countries, so stuffing multiple data elements into a single field so that your standard address print routines will produce a pretty address will limit the ability to segregate data by those values. For example, in Italy and many other countries, the postal code precedes the city name. If you enter these both in the City field so that standard North American address printing routines will format them correctly, you can no longer use City as a selection for reporting, and it cannot be validated by list.

If you try to put a combination of province and district in the state field, the field cannot be validated because all countries don't require a state or province, and you won't be able to easily standardize them. Doing this will introduce error and variances during data entry and will insure that you will have duplicate records over time when an address doesn't appear as expected. Be aware that while a two character abbreviation of the state or province is the rule in North America, not so in the rest of the world. For example, the field you use to hold provinces must be long enough to hold Xinjiang Uygur, if you have Chinese data in your database.

Make sure that you can validate multiple sets of data in a field. If you decide to put states and provinces in the same field, both the U.S. and Italy have a state abbreviation of MI. Make sure that the validation and descriptions are associated with the country - state/province combination, not just the state/province.

Data Owners

Data should be maintained by someone native to, or expert in, the language and standards of the country. People tend to try to force data into a familiar format. For addresses, this almost always results in address components being altered or dropped, misspellings and misrepresentations. This is the single largest reason for duplicated customer records in international databases. For example, STRASSE in Germany may be abbreviated as STR. But it is not valid to abbreviate it as ST, as in North America. In Quebec, Canada, the suite number can precede the street number, separated by a comma. Americans not aware of this will try to "fix" or translate these addresses. 24, 123 Rue Blanc becomes 24123 Rue Blanc or 24123 Blanc St. If it were to be translated to English, it should be 123 Blanc St, Ste 24.


Your systems must support merging of two records and appropriate movement of all associated data without negative impact to operations. If you can't eliminate duplicate records from your database, your data quality will decrease over time and you will not be able to realistically pull customer activity together.

If you are pulling together multiple data sources that will contain overlapping records, the total number of records in the database exceeds 25,000 and there are new records coming in on a regular basis, you will need an automated address cleansing tool to cleanse and standardize data. You will also need a tool to identify duplicates and potential duplicates and to apply changes to support merges. A common example is of a company with multiple divisions, each with its own customer master, all feeding data to a sales data warehouse. When the divisions have overlap in their customer base, by design there are duplicates when they are pulled together. Manual review to merge them would be a time-consuming, tedious, repetitive and error-prone process that will introduce more errors than it corrects unless a highly trained expert is making the changes.

Protect Your Data

Do not trust outside data sources. Always put a review process in place - preferably a combination of automated validations and manual review. Sooner or later, this one will bite you if you blindly import data without subjecting it to the same scrutiny you put your own data under.

Entry Point Validation

If you have more than one data entry location or more than 100 addresses being searched and potentially added each day, integrate an online address clean and validation product into your data entry process. Ideally, it will standardize the address for the country it is in, and then compare the result against the existing addresses, returning a list of likely matches. The comparison will probably require yet another tool because writing the fuzzy logic matching routines would be fairly complex. However, if you employ a data quality tool for ongoing data scrubbing, it will probably handle address comparisons, and you can probably use it in an online environment as well.

Aged Data

Even if you have a good, "mailable" address, if it has been in your database for a long time, the company or person associated with it may not live there anymore. In establishing your business processes, build in ways to inactivate or delete out of date information, or to mark it as suspect.

If you make the commitment to improving your data quality, it will pay off in your ability to use CRM, BI and DW effectively. It isn't cheap, it isn't easy and it isn't fun, but your competitors are probably doing it. Remaining in business in the next decade depends on knowing a lot more about your customers and your business than ever before. The ability to pull together all the data from disparate sources, and to combine and interpret it into information depends on having clean data. Recognizing this and planning for it will go a long way towards ensuring that your company will be as competitive tomorrow as it is today.

Source: DM Review

© 2015 IT Dimensions, Inc. All Rights Reserved
If your DQ initiative failed, call us: 1-718-777-3710 | Astoria | New York | USA
Data Quality