Automating Data Cleaning and Validation with DQS and SSIS, Part 1
I have been doing some work with SQL Server Data Quality Services (DQS) in preparation for SQL Saturday in Salt Lake City (https://www.sqlsaturday.com/349/eventhome.aspx ). Part of my code demo involves validating addresses using MelissaData Address Check.
Address data is one of the classic drivers behind many master data management or data quality projects. Sending a bill to the wrong address is, after all, a waste of paper and postage and guarantees that you will not get paid. It is difficult to imagine an organization that does not need to use a mailing address. Even pure internet or social media companies – Goggle, Twitter, etc. – need to send out bills and track employee addresses. Maintaining the quality of address data is an ongoing processes that every business needs to engage in, so I will cleaning and verifying addresses found in the Adventure Works sample databases.
DQS has several advantages for the task of cleaning address data. It is easily integrated with MelissaData using the Reference Data functionality. The DQS client is very user friendly so that the business rules can be maintained by the business itself. In the past, even the simplest maintenance tasks required IT to run some script. If someone misspelled “Hackensack, NJ” the business would need to put in a service request of some sort and some IT worker would need to create an update statement –a long process to correct a simple typo.
DQS is easily automated within SSIS. One simply creates a DQS transform within a data flow task and the latest rules, created by the business using the DQS client, are applied to the data.
To work with DQS you will need to have both a server and a client installed. This is a simple, wizard driven install. You can read the details at http://msdn.microsoft.com/en-us/library/gg492277.aspx. The server is installed on your database server. Upon completion of the server install you should have three new databases, DQS_MAIN, DQS_PROJECTS and DQS_STAGING_DATA.
Those with a love of minutia may want to open these up and delve into the individual objects. For the rest of us, just make sure these databases are on a regular backup and maintenance schedule. These databases store all the rules and settings for all of the DQS work done in your organization. A great deal of work and otherwise undocumented knowledge can be lost in the event of a disaster.
For address verification, you will need to sign up for an Azure account. Once you have this account go to the store and “purchase” MelissaData Address Check, but this is a free service if you process less than 1000 addresses per month.
Using DQS for Address Cleaning
For this sample I took 25 records from the Adventure Works sample database. In the past I never really inspected the quality of this data. It was simply bits I used to demonstrate or learn technologies. As a sample database I expected this to be very clean data. Some of the issues I found, like addresses that do not, in fact, exist, I should have expected given a moment’s thought. After all, most folks would not want their real addresses in a sample database so widely used. Typos and spelling mistakes for place names I did not expect, however, so this proved to be a better source for this demo than expected.
Taking a close look at this sample data, notice a few items. While most of the addresses are in the United States or Canada, I deliberately put in a German address. MelissaData address verification only handles US and Canadian addresses so this record will fail verification. Similarly, “Corporate Office, El Segundo, CA” is obviously incomplete. You should also notice incomplete Canadian postal codes and a misspelling for the town of Pont-Rouge, Quebec – the “o” and “n” are transposed to form “Pnot-Rouge”.
Set Up the Knowledge Base
A knowledge base in DQS is a set of rules that are applied to clean data. The basic element of a knowledge base is a “Domain”, a collection of rules that are applied to a specific data element. For our source table we would set up domains around the address1, address2, city, state or province, postal code and country fields.
It should be emphasized here that the DQS client is a tool the business, specifically the data stewards, should be using, not IT staff. IT knows technology; the business knows the data. The DQS client allows the business to create the rules to clean and standardize data.
When the data steward opens the DQS client they will see sections for Knowledge Base Management, Data Quality Projects and Administration. The first step is to create a knowledge base, a collection of rules, used to ensure data quality.
When we automate the DQS transformation we will need to encapsulate a set of rules to clean the data, store the cleansed data temporally, then send the data to MelissaData for validation is a separate step. The automated data validation step sends the data out in its current state without applying any transformations, such as correcting typos. Simply sending data out for verification we know to be incorrect would be a waste, so our first step will be to create a knowledge base to clean the data before sending it to MelissaData. Consider this pre-soaking the dirty laundry before the wash cycle. Create a new knowledge base and assign it a name; here I have named it “Clean for Melissa”. Create a domain for each of the relevant fields as shown below.
As noted early, we know we have a typo in the city field, so we can create a domain rule for the city domain to correct this typo. A data steward can do this several ways. First the steward may simply pull in a sample of the data to be cleansed and review it manually. Once an invalid entry is found, they flag it and put in the correct value in the “Domain Values” tab.
Alternatively, the steward can use the “Term-Based Relations” tab to set up a conversion.
The knowledge base created in this step will clean the data before it is validated by MelissaData. In order to automate the processes of both cleaning the data and validating the data, the data steward will need to create a second, separate knowledge base for validation. In the DQS transform step the data will be sent for validation as it exists in the source system without applying any corrections. Therefore, we need to clean the data first then validate it, encapsulating each set of rules in a separate knowledge base. Save and publish the resulting knowledge base.
The data steward now creates a separate knowledge base. Here I have named it “MelissaData Demo”. As you can see below, the data steward again creates domains for each of the entities, but also creates a composite domain. While a domain is a collection of rules, a composite domain can be thought of as a collection of collections. The composite domain supports rules that are applied across all domains. Here I have created the “AddressComposit” and added all the individual domains to the composite domain.
Once the domains have been added to the composite, we then set up MelissaData address check for validation using the reference data tab.
The data steward would then relate each of the individual domains to the MelissaData RDS schema, as shown below.
Again, save and publish the knowledge base.
Note that, up until this point there has been no involvement by IT. The data steward has created a set of rules that are to be applied to the data on an ongoing basis. Each time the data steward modifies one of these domains and publishes the results, the updated rules are automatically applied to any DQS project or DQS transformation task is SSIS. In the past, creating these rules and keeping them up to date was a never ending project. It was frustrating for the users, because they could not get even the simplest rule changes made without code changes and testing. The old process was, at best, expensive in terms of IT resources and really boring for IT workers.
With this new process the work is divided in a far more logical fashion. The people who know the data create the rules that the data to which the data should comply. Those are the folks who know that Pnot-Rouge is not a town in Quebec, while Pont-Rouge is. In part two, I will look at the technical work for automating this process – the work of IT professionals.
Thanks for this great article. I love this.
For purposes of this post, I’ll assume you, fellow traveler, are familiar with the basics of SSIS. DQS is used as a transformation within a data flow task. To start, set up your connections to your source and target databases and set up your data source. In this case I set up an OLE DB source and, for the first data flow, pointed it at a demo table I set up holding address data to clean and validate.