Best Practices Series: Rules for Data Validation
Why should you care about Data Validation? Actually, your decisions can only be as good as the data they apply to. Consequently, by improving the quality of the data you apply your decisions to, you will improve the quality of your decisions. There is inherently a strong bond between data and decisions. Our previous post highlights the importance of data in Decision Management. In this post, we will focus on strategies to improve Data Validation.
As a matter of fact, there are many forms of bad data: incorrect, missing, fraudulent, etc. Data Validation needs to address all these forms of problems you might encounter. Let’s take a deeper look at those.
First of all, let’s address incorrect data. Incorrect values might be as silly as entering “CA” when the system expects “California”. Has that happened to you and your systems? Given modern, decoupled architecture, I have seen this issue more often than I would have expected. In some other cases, the data has the right format, but it is populated in a different field than you might have expected. There are several aspect to incorrect data.
This is, in my mind, the first and foremost case of data validation, because of how pervasive this problem is. However, as obvious as it might seem, it is not always the easiest to get rid of. It takes rigor and patience to chase these ‘bad data’ occurrences.
In this instance, my recommendation is to use business rules to validate data before you hit your actual business rules. Validate that the data is complete and correct before you make your business decision.
Many systems that I have worked with have implemented such data validation rules. Not only can you check allowed values, but you can also apply patterns to ensure that email addresses look right. You can validate date chronology, number ranges, or string length. You can even check that zipcodes and identification numbers match the given state. Data validation rules can cover any scenario you can think of.
Business rules present the advantage that they can do something about the ‘mistake’:
- At the simplest level, you can check that the provided state is one of the 50 states, in the US. Of course, you can include territories and military acronyms. That way, the system will report “California” as incorrect prior to processing your business decision.
- Ultimately, you can automatically translate known issues. If systems you work send you those spelled out states, you can map “California” to “CA”, and proceed to your business decision.
Missing data is a related, but slightly different, issue. I single it out here as it is often a consequence of reflexive logic. Your system might require some data only in some circumstances. For example, information about your spouse would only make sense if you indicated being married.
Business rules, rather than a hard-coded service, have the ability to check different paths very clearly. Since business rules are by definition atomic pieces of logic, you do not have to worry about the spaghetti code that would follow those various paths. You only have to specify that ‘when family status is married’, you want to check that ‘all spouse info has been provided’. At that point in time, adding exclusion per state, if one state would not allow you to require this information, is a piece of cake.
Complex forms with a high level of reflexivity can also push this data validation at capture time. Dynamic questionnaires can show these questions only when applicable, removing the burden from the applicant. Indeed, the applicant will only see reflexive questions when applicable. When answering ‘single’ to the family status question, the spouse questions remain hidden. By reducing the complexity up-front, you can ensure a higher level of completion.
Dynamic questionnaires and data validation rules are complementary solutions. They work particularly well together when data could come from various systems: online app and batch entries from partners for example.
Confusing questionnaires and complex data integration can lead to incorrect and missing data. That is what we have addressed so far. However, addressing this complexity will never cases of fraudulent data. If the applicant has bad intentions, answers might be well-formed but not-truthful. Other techniques need to be explored.
One obvious solution is to corroborate the applicant’s answers with established information. You can, for example, compare those answers with data-sources such as credit bureaus, department of motor vehicle, and any other such institution.
In another scenario, the data may not be available in that form. Think about a payment transaction. It is unlikely that a report will tell you if that payment transaction is legitimate. Analytics can help though. Ranging from profiling to predictions, analytics offer several strategies.
Using trend analysis, systems can calculate real-time how a transaction compares to ‘typical transactions’ for this person, this zipcode, or this demographic group. Based on deviation, the system can decide to let it go through, go to a manual process, or be stopped. As a credit card holder, you have probably faced this issue at some point in time when traveling outside of your home geography. God knows it happened to me a lot!
I have also worked with customers like PayPal on fraud detection using predictive analytics. Unlike profiling, predictive models can learn the specifics of a cyber-attack. Within hours, the system can push business rules that target it, and make it harmless. Predictive models require evidently more information that may or may not be available. When you have loads of good data, this is heaven. Unfortunately, as much as bad data will make a bad decision, bad data will make a terrible prediction. And there we are again: chasing better data!
In the end, data validation can take different forms. Many tools in this toolbox will address different causes of bad data. Business rules in the form of validation rules, dynamic questionnaires, profiling rules, and predictive models are powerful solutions. Keep in mind, bad data = bad decisions. It is worth addressing it.