In our last article on data quality, we had written about the three essential components of data strategy which is Accurate, Completeness and Consistency. This is good for organizations that have just started their data quality journey. For organizations that are a little ahead with respect to their data quality plan, this article talks about a few other components that can be added to your data quality strategy and about the importance of monitoring your data quality through a scoring methodology.
The Five Components of Data Quality
As mentioned in the blog earlier it is not uncommon for companies to start their data quality journey with the three most essential – accuracy, completeness and consistency. However as companies mature in their data quality journey it is important to remember that a comprehensive data quality plan measures the following five components.
- Accuracy (Is the information correct in every detail) – This is when you must ask if the data is reflecting a real world situation. Can the enrolment date of a subject in a trial be in the future or before the Informed Consent Signed Date?
- Completeness (is the information complete) – For data on Site Address, It is important to have a complete Pin Code to consider the address complete.
- Reliability or Consistency (is the information the same as other trusted sources) – This means that a piece of information should not contradict the same piece of information in another system. The First Dose date coming out of your IVR (Interactive Voice Response) system should be the same as that in your CTMS (Clinical Trial Management System) or your EDC System. (Electronic Data Capture)
- Relevance (Do you really need the information) – This is important to identify that the asset that you are calculating is essential for your business. In today’s world with increasing data storage cost
- Timeliness (How up to date is the information) – It is important to know how frequently the that data asset is updated, so that businesses are not making decisions out of old data. Lot of time eCRF data entered is not available in the database in the real time but only after a certain frequency.
Questions to Ask Before Computing Your Data Quality Score
Before you begin to compute your data quality score it is important to decide and finalise how you want the score to be calculated so that there is no confusion across businesses when they are trying to interpret that score and take action based on the score. You may want to ask yourself the following questions:
- What data assets do you want to score. E.g. do you want to do it for all assets of the database>schema>table>column type or do you want to identify the assets that are critical for your decision making. It is important to start small, see the entire scoring in action and then apply it to your universe of data assets. Identify your critical data elements vs non critical data elements and start scoring the data quality for the former.
- What are the different data quality rules that you are measuring. In general there are five essential components on data quality however depending on what stage the company is in their data quality journey they may want to pick and choose a few out of the five based on the nature of their business.
- Are all the data quality rules equally important or do you want to assign different weightage to your different data quality rules?. E.g. is the duplication rule more important than duplication.
- Do you want to assign different weightage to your rules based on the different data assets. E.g., For your critical data elements is the duplication rule more important than the completeness rule.
It is essential to discuss these questions in detail to develop your standard operating procedures with respect to critical vs non critical data elements and your different business rules.
How to Set up Your Data Quality Score
A score is a percentage that represents the health of one aspect of an asset, based on a set of measures that you define. Data Quality Scores are computed based on the quality dimensions for each individual column in the data set. A data quality rule is assigned to the asset at the lowest level of your hierarchy. For e.g. in the world of Clinical Development the asset “Protocol Number” may have the completeness and duplication rule associated with it.
On the other hand if you have a Business Term, the business term may be associated with one or more data assets. We will need to map each of the data asset to the required rule and then score for each rule and then do a weighted aggregation (If different rules have different weightage) to calculate the data quality score for your business term.
From there you can establish relationships that allow you to aggregate your data quality score across all your data assets within a particular business process. Once the relationships are established, a composite score is calculated for the entire data set. The Total Data Quality Score for your is the sum of such composite scores from all your data sets.
What this kind of scoring allows you to do is identify areas of improvement very quickly. It will tell you which are your major problem areas and enable you to allocate your resources efficiently.
At DefineRight we help companies do the groundwork for your data quality scoring exercise. As you have seen it is just not about configuring the scores into the data quality system that you are using but a lot of work needs to happen before that to define the data scoring process to ensure common interpretation even when people from different businesses are looking at the score. In Keeping with our discovery focused delivery methodology, we work in partnership with business teams to do the heavy lifting around identifying the why, what of the scoring exercise before implementing and getting to the how part of it.
References
- https://www.precisely.com/blog/data-quality/5-characteristics-of-data-quality
- https://doc.infogixsaas.com/govern/Content/d-admin/scoring-data-quality-tutorial.htm#Structure