TSI Data Quality Principles

Contents

  1. Data Quality Perspectives
  2. Data Quality Values
  3. Data Quality Standards
  4. Data Quality Tests
  5. Data Quality Process
  6. References

About

This living document is aims to capture our Data Quality values and standards to set scene for better, cleaner data throughout the Telkom group. In short we seek to ensure veracity and trustworthy data by setting a high standard for Data Quality.

Data Quality Levels/Perspectives

Data Quality can bee seen from 3 perspectives: science, engineering and management. From a data science perspective one observes the statistical dimensions of data, from an engineering perspective the logical dimensions, and from management perspective the physical dimensions. These are all equally important, but are observed at different parts of the data life-cycle and the observations have different lifespans.

Logical Level (The Engineering Perspective)

Logical dimensions exist to prevent irregularities on the lowest level by ensuring consistancy between the data, the schema and the database. The standard for logical dimensions should be set on the highest level of the organisation to ensure consistency between all databases and datsets throughout the organisation. Dataset specific standards can also be defined by the data owner and these will take precidence over the standars on the org level. Once the standard for a dataset has been developed, it can be applied in an automated fashion prior to ingestion into a warehouse environment, ensuring all the data in long term storage conforms to the organisation standards for data on the lowest level before it rests. Logical dimensions are therefore the first to be checked, and as long as the checks are done on all data prior to ingestion, the checks would only need to be done once-off with additional checks done on data already in the warehouse periodiocally.

Physical Level (The Management Perspective)

Physical dimensions look at the values within the data and metadata to ensure compliance to rules set by the organisation and data owner(captured in the data dictionary) at all times. Metadata exists on a number of levels (field, record, table, dataset) and includes things such as ownership, access, integrity and compliance to validation rules. These validation rules and other checks can be set on an organisation level for data that spans multiple projects, but in most instances projects will have unique requirements that warrent thier own set of rules and checks. All rules and checks should be constantly tracked to ensure compliance is never broken. This is usually done through jobs and security policies set in the data warehouse.

Statistical Level (The Science Perspective)

Statistical dimensions help the data scientists understand the data and consists of the first things a data scientist would normally look at when starting to analyse a dataset. Some of the more basic statistical checks are useful to ensure that the data is indeed in a desirable state. The data scientists should be able to start using the dataset without having to do any cleaning and if any errors or shortcoming are found, the data scientists should infrom data management and data engineering so the problem can be fixed upstream, as close to source as possible.

Data Quality Values

The main goal of Data Quality is to ensure that data is trustworthy. Without trust, users will create thier own data stores and the single source of truth would be lost. By defining what we value in terms of Data Quality, we are able to ensure the data is trustworthy and we can then use those values to create standards which ensure all our data quality values are upheld.

To cover all the bases we researched a few articles that spoke about the various dimensions of Data Quality. Below are links to the articles along with a key to represent them in the tables below:

After analysing these articles, we mapped all the different dimensions to the five values which we feel captures :

TSI Data Quality Value ↓ 6C's DAMA UK's 6D's 7D's DAMA NY's 7C's
Completeness Complete Completeness Completeness & Integrity Completeness & Precision Completeness Completeness and Comprehensiveness
Consistency Consistent Consistency Consistency Reliability Semantic, Structure, Reasonability & Consistency Reliability and Consistency
Cleanliness Clean Uniqueness & Accuracy Accuracy Accuracy Accuracy & Identifiability Accuracy and Precision & Granularity and Uniqueness
Compliance Compliant & Collaborative Validity Conformity Integrity & Confidentiality Lineage Availability and Accessibility & Legitimacy and Validity
Currency Current Timeliness Timeliness Timeliness Timeliness & Currency Timeliness and Relevance

For each of these five selected values, we need to analyse Data Quality from all three perspectives to create a detailed picture of Data Quality:

1. Completeness

We want to ensure that data sets are complete by checking all dimensions against the source, and also by providing all data required for context and questions that may arise.

  • Logical Level

    • Structured Data:

      • Ensure tables contain all the correct columns
      • Ensure data sets contain all the correct tables
      • Ensure columns that require values are set as not nullable in DB
        • Unstructured Data:
      • Ensure maps contain all the correct fields
      • Ensure collections contain all the correct maps
      • Ensure data sets contain all the correct collections
  • Physical Level

    • All Data:
      • Ensure that the number of null values does not exceed allowed tolerances
    • Structured Data:
      • Ensure tables contain all the correct records
      • Track the number of fields that contain null values per column
    • Unstructured Data:
      • Ensure collections contain all the correct maps
      • Track the number of fields that contain null values per map
  • Statistical Level

    • Should no longer be concerned with basic completeness, but rather the knowledge and wisdom within the data. Sometimes a raw data set is not enough on it's own to answer our questions, in which the case Data Science should point out the gaps to Data Management who in turn can inform Data Engineering of where to locate additional data and how it shoukd be integrated into the existing data set.

2. Consistency

We want to ensure that data is consistent on all levels (field, record, table, dataset, metadata) by checking values accross databases and against the original soruce. Additionally, we need to ensure business definitions have been apllied and are conformed to.

  • Logical Level

    • Structured Data:
      • Ensure all columns are set to the correct type
      • Ensure all columns are named according to the standards
    • Unstructured Data:
      • Ensure all fields within a map are set to the correct type
      • Ensure all fields are named according to the standards
  • Physical Level

    • All Data:
      • Ensure that values remain consistant accorss diffrent databases and datasets
    • Structured Data:
      • Ensure all values within a column conform to the assigned type
      • Ensure all values within a column conform to standards that apply to the column's type
    • Unstructured Data:
      • Ensure all values within a map conform to the correct type
      • Ensure all values within a map conform to standards that apply to that field's type
  • Statistical Level

    • Ensure definitions within the data confrom to definitions in the standards
    • Ensure relationships within the data and between the tables remain intact
    • Ensure that the way data was captured did not change over time
    • Ensure that values in the data are consistent with expecatations created by previous data

3. Cleanliness

We want to ensure that our data is clean and an accurate representation of the real world. If it is not, we want to know if the data can be cleaned and to what extent it needs to be cleaned. If all the other DQ values are upheald, the data should be very close to clean.

  • Logical Level
    • Ensure each table has a unique primary key
    • Ensure that there is no duplication of columns within the table
    • Ensure that there is no duplication of tables within the dataset
    • Ensure that there is no duplication this dataset elsewhere in the database
  • Physical Level
    • Ensure that there is no duplication of records within the table
    • Ensure cleanlines is mainteined by running the data verification proccess for the datatset at least once per specified time period
    • Ensure the dataset's score in each of the DQ values does not fall below the specified threshhold
  • Statistical Level
    • Ensure the data is objectivly verifiable and measured/collected correctly
    • Ensure the level of detail matches the requirements, aggregation and transfromation results should be stored seperatly

4. Compliance

  • Logical Level
  • Physical Level
    • column/key/collection/document names conform
    • Ensure each of the DQ values dont deterioritae belown a certin threshold
    • Ensure the data confroms to all rules and regulations set all of the following levels
      • Data Dictionary (The standards set in the data dictionary)
      • TSI (The standards set in this document)
      • Telkom (Data/Digital Privacy Policy)
      • South Africa (PoPI)
    • Ensure data values comply with all validations
    • Ensure proof of compliance is maintained
    • Ensure the DQ process is automated as much as possible to remove human error/bias
    • Ensure accessabilioty by promoting the dataset and allowingusers to apply for access
  • Statistical Level *

5. Currency

  • Logical Level
    • Ensure datasets always remain true to source by making sure a connection to the source database is possible and an automated pipeline exists
  • Physical Level
    • Ensure time series data is an accurate representation of real world events
    • Ensure datasets remain highly availbale to all users at all times to enable timly analysis
  • Statistical Level

Data Quality Standards

This section is still very much a work in progress. Standards should mostly be set at source or on organisational level and fed into TSI. For now this is a guideline list of requirements from a predominantly engineering perspective and needs input from the other viewpoints.

1. Completeness

  1. Logical Level
    1. Specify which columns/keys do not require values
    2. Specify tolerance percentage for missing values in nullable and not nullable columns/keys
    3. Columns/keys implement nullable and not nullable in source DB
  2. Physical Level
    1. Specify all the correct records/keys in a table/map
    2. Specify all the correct columns/maps in a table/collection
    3. Specify all the correct tables/collections in a data set

2. Consistency

Data Quality Tests

1. Completeness

Level Data Type Test Test Output Standard Test Condition Definer Tester Reporter Reference
Logical Structured Required columns are not nullable in DB List of misconfigured columns 1.1.3 Data Dictionary DE DE DAMA UK: Completeness & DAMA NY: Completeness
Logical Structured null Field check/count Percentage of null fields 1.1.1 & 1.1.2 Data Dictionary DE DM DAMA UK: Completeness & DAMA NY: Completeness
Logical Unstructured null Value check/count Percentage of null values 1.1.1 & 1.1.2 Data Dictionary DE DM DAMA UK: Completeness & DAMA NY: Completeness
Physical Structured All columns present List of missing columns 1.2.2 Data Dictionary DM DM 6C's: Complete
Physical Structured All records present Range of missing records 1.2.1 Data Dictionary DM DM 6C's: Complete
Physical Structured All tables present List of missing tables 1.2.3 Data Dictionary DM DM 6C's: Complete
Physical Unstructured All maps present List of missing maps 1.2.2 Data Dictionary DM DM 6C's: Complete
Physical Unstructured All keys present List of missing keys 1.2.1 Data Dictionary DM DM 6C's: Complete
Physical Unstructured All collections present List of missing collections 1.2.3 Data Dictionary DM DM 6C's: Complete
Physical All Data No irrelevant or confusing data is present Possibly irrelevant or confusing data - Data Dictionary DM DM 6C's: Complete
Statistical All Data Is the data comprehensive and sufficient enough? Gaps in the data set - DS DS DM 6D's: Completeness, 7D's Completeness & 7C's Completeness and Comprehensiveness
### 2. Consistency

Level | Test | Test Output | Standard | Test Condition Definer | Tester | Reporter| Reference --- | --- | --- | --- | --- | --- | --- | --- | --- Consistency | | | | | | Consistency | | | | | | Consistency | | | | | | Consistency | | | | | | Consistency | | | | | |

Data Quality Process

References

json_map = {
    "field1": "value",
    "field2": 2,
    "field3": ["well", "thats", 2, "cool"],
    "field4": {"child_field1": "child_value"}
}