TSI Data Quality Principles
Contents
- Data Quality Perspectives
- Data Quality Values
- Data Quality Standards
- Data Quality Tests
- Data Quality Process
- References
About
This living document is aims to capture our Data Quality values and standards to set scene for better, cleaner data throughout the Telkom group. In short we seek to ensure veracity and trustworthy data by setting a high standard for Data Quality.
Data Quality Levels/Perspectives
Data Quality can bee seen from 3 perspectives: science, engineering and management. From a data science perspective one observes the statistical dimensions of data, from an engineering perspective the logical dimensions, and from management perspective the physical dimensions. These are all equally important, but are observed at different parts of the data life-cycle and the observations have different lifespans.
Logical Level (The Engineering Perspective)
Logical dimensions exist to prevent irregularities on the lowest level by ensuring consistancy between the data, the schema and the database. The standard for logical dimensions should be set on the highest level of the organisation to ensure consistency between all databases and datsets throughout the organisation. Dataset specific standards can also be defined by the data owner and these will take precidence over the standars on the org level. Once the standard for a dataset has been developed, it can be applied in an automated fashion prior to ingestion into a warehouse environment, ensuring all the data in long term storage conforms to the organisation standards for data on the lowest level before it rests. Logical dimensions are therefore the first to be checked, and as long as the checks are done on all data prior to ingestion, the checks would only need to be done once-off with additional checks done on data already in the warehouse periodiocally.
Physical Level (The Management Perspective)
Physical dimensions look at the values within the data and metadata to ensure compliance to rules set by the organisation and data owner(captured in the data dictionary) at all times. Metadata exists on a number of levels (field, record, table, dataset) and includes things such as ownership, access, integrity and compliance to validation rules. These validation rules and other checks can be set on an organisation level for data that spans multiple projects, but in most instances projects will have unique requirements that warrent thier own set of rules and checks. All rules and checks should be constantly tracked to ensure compliance is never broken. This is usually done through jobs and security policies set in the data warehouse.
Statistical Level (The Science Perspective)
Statistical dimensions help the data scientists understand the data and consists of the first things a data scientist would normally look at when starting to analyse a dataset. Some of the more basic statistical checks are useful to ensure that the data is indeed in a desirable state. The data scientists should be able to start using the dataset without having to do any cleaning and if any errors or shortcoming are found, the data scientists should infrom data management and data engineering so the problem can be fixed upstream, as close to source as possible.
Data Quality Values
The main goal of Data Quality is to ensure that data is trustworthy. Without trust, users will create thier own data stores and the single source of truth would be lost. By defining what we value in terms of Data Quality, we are able to ensure the data is trustworthy and we can then use those values to create standards which ensure all our data quality values are upheld.
To cover all the bases we researched a few articles that spoke about the various dimensions of Data Quality. Below are links to the articles along with a key to represent them in the tables below:
- 6C's → The Six Cs of Trusted Data
- DAMA UK → The six Primary dimensions for Data Quality assessment
- 6D's → 6 Dimensions of Data Quality
- 7D's → Dimensions of data quality
- DAMA NY → Data Quality Fundamentals
- 7C's → Seven Characteristics That Define Quality Data
After analysing these articles, we mapped all the different dimensions to the five values which we feel captures :
TSI Data Quality Value ↓ | 6C's | DAMA UK's | 6D's | 7D's | DAMA NY's | 7C's |
---|---|---|---|---|---|---|
Completeness | Complete | Completeness | Completeness & Integrity | Completeness & Precision | Completeness | Completeness and Comprehensiveness |
Consistency | Consistent | Consistency | Consistency | Reliability | Semantic, Structure, Reasonability & Consistency | Reliability and Consistency |
Cleanliness | Clean | Uniqueness & Accuracy | Accuracy | Accuracy | Accuracy & Identifiability | Accuracy and Precision & Granularity and Uniqueness |
Compliance | Compliant & Collaborative | Validity | Conformity | Integrity & Confidentiality | Lineage | Availability and Accessibility & Legitimacy and Validity |
Currency | Current | Timeliness | Timeliness | Timeliness | Timeliness & Currency | Timeliness and Relevance |
For each of these five selected values, we need to analyse Data Quality from all three perspectives to create a detailed picture of Data Quality:
1. Completeness
We want to ensure that data sets are complete by checking all dimensions against the source, and also by providing all data required for context and questions that may arise.
-
Logical Level
-
Structured Data:
- Ensure tables contain all the correct columns
- Ensure data sets contain all the correct tables
- Ensure columns that require values are set as
not nullable
in DB- Unstructured Data:
- Ensure maps contain all the correct fields
- Ensure collections contain all the correct maps
- Ensure data sets contain all the correct collections
-
-
Physical Level
- All Data:
- Ensure that the number of
null
values does not exceed allowed tolerances
- Ensure that the number of
- Structured Data:
- Ensure tables contain all the correct records
- Track the number of fields that contain
null
values per column
- Unstructured Data:
- Ensure collections contain all the correct maps
- Track the number of fields that contain
null
values per map
- All Data:
-
Statistical Level
- Should no longer be concerned with basic completeness, but rather the knowledge and wisdom within the data. Sometimes a raw data set is not enough on it's own to answer our questions, in which the case Data Science should point out the gaps to Data Management who in turn can inform Data Engineering of where to locate additional data and how it shoukd be integrated into the existing data set.
2. Consistency
We want to ensure that data is consistent on all levels (field, record, table, dataset, metadata) by checking values accross databases and against the original soruce. Additionally, we need to ensure business definitions have been apllied and are conformed to.
-
Logical Level
- Structured Data:
- Ensure all columns are set to the correct type
- Ensure all columns are named according to the standards
- Unstructured Data:
- Ensure all fields within a map are set to the correct type
- Ensure all fields are named according to the standards
- Structured Data:
-
Physical Level
- All Data:
- Ensure that values remain consistant accorss diffrent databases and datasets
- Structured Data:
- Ensure all values within a column conform to the assigned type
- Ensure all values within a column conform to standards that apply to the column's type
- Unstructured Data:
- Ensure all values within a map conform to the correct type
- Ensure all values within a map conform to standards that apply to that field's type
- All Data:
-
Statistical Level
- Ensure definitions within the data confrom to definitions in the standards
- Ensure relationships within the data and between the tables remain intact
- Ensure that the way data was captured did not change over time
- Ensure that values in the data are consistent with expecatations created by previous data
3. Cleanliness
We want to ensure that our data is clean and an accurate representation of the real world. If it is not, we want to know if the data can be cleaned and to what extent it needs to be cleaned. If all the other DQ values are upheald, the data should be very close to clean.
- Logical Level
- Ensure each table has a unique primary key
- Ensure that there is no duplication of columns within the table
- Ensure that there is no duplication of tables within the dataset
- Ensure that there is no duplication this dataset elsewhere in the database
- Physical Level
- Ensure that there is no duplication of records within the table
- Ensure cleanlines is mainteined by running the data verification proccess for the datatset at least once per specified time period
- Ensure the dataset's score in each of the DQ values does not fall below the specified threshhold
- Statistical Level
- Ensure the data is objectivly verifiable and measured/collected correctly
- Ensure the level of detail matches the requirements, aggregation and transfromation results should be stored seperatly
4. Compliance
- Logical Level
- Physical Level
- column/key/collection/document names conform
- Ensure each of the DQ values dont deterioritae belown a certin threshold
- Ensure the data confroms to all rules and regulations set all of the following levels
- Data Dictionary (The standards set in the data dictionary)
- TSI (The standards set in this document)
- Telkom (Data/Digital Privacy Policy)
- South Africa (PoPI)
- Ensure data values comply with all validations
- Ensure proof of compliance is maintained
- Ensure the DQ process is automated as much as possible to remove human error/bias
- Ensure accessabilioty by promoting the dataset and allowingusers to apply for access
- Statistical Level *
5. Currency
- Logical Level
- Ensure datasets always remain true to source by making sure a connection to the source database is possible and an automated pipeline exists
- Physical Level
- Ensure time series data is an accurate representation of real world events
- Ensure datasets remain highly availbale to all users at all times to enable timly analysis
- Statistical Level
Data Quality Standards
This section is still very much a work in progress. Standards should mostly be set at source or on organisational level and fed into TSI. For now this is a guideline list of requirements from a predominantly engineering perspective and needs input from the other viewpoints.
1. Completeness
- Logical Level
- Specify which columns/keys do not require values
- Specify tolerance percentage for missing values in
nullable
andnot nullable
columns/keys - Columns/keys implement
nullable
andnot nullable
in source DB
- Physical Level
- Specify all the correct records/keys in a table/map
- Specify all the correct columns/maps in a table/collection
- Specify all the correct tables/collections in a data set
2. Consistency
Data Quality Tests
1. Completeness
Level | Data Type | Test | Test Output | Standard | Test Condition Definer | Tester | Reporter | Reference |
---|---|---|---|---|---|---|---|---|
Logical | Structured | Required columns are not nullable in DB |
List of misconfigured columns | 1.1.3 | Data Dictionary | DE | DE | DAMA UK: Completeness & DAMA NY: Completeness |
Logical | Structured | null Field check/count |
Percentage of null fields |
1.1.1 & 1.1.2 | Data Dictionary | DE | DM | DAMA UK: Completeness & DAMA NY: Completeness |
Logical | Unstructured | null Value check/count |
Percentage of null values |
1.1.1 & 1.1.2 | Data Dictionary | DE | DM | DAMA UK: Completeness & DAMA NY: Completeness |
Physical | Structured | All columns present | List of missing columns | 1.2.2 | Data Dictionary | DM | DM | 6C's: Complete |
Physical | Structured | All records present | Range of missing records | 1.2.1 | Data Dictionary | DM | DM | 6C's: Complete |
Physical | Structured | All tables present | List of missing tables | 1.2.3 | Data Dictionary | DM | DM | 6C's: Complete |
Physical | Unstructured | All maps present | List of missing maps | 1.2.2 | Data Dictionary | DM | DM | 6C's: Complete |
Physical | Unstructured | All keys present | List of missing keys | 1.2.1 | Data Dictionary | DM | DM | 6C's: Complete |
Physical | Unstructured | All collections present | List of missing collections | 1.2.3 | Data Dictionary | DM | DM | 6C's: Complete |
Physical | All Data | No irrelevant or confusing data is present | Possibly irrelevant or confusing data | - | Data Dictionary | DM | DM | 6C's: Complete |
Statistical | All Data | Is the data comprehensive and sufficient enough? | Gaps in the data set | - | DS | DS | DM | 6D's: Completeness, 7D's Completeness & 7C's Completeness and Comprehensiveness |
### 2. Consistency |
Level | Test | Test Output | Standard | Test Condition Definer | Tester | Reporter| Reference --- | --- | --- | --- | --- | --- | --- | --- | --- Consistency | | | | | | Consistency | | | | | | Consistency | | | | | | Consistency | | | | | | Consistency | | | | | |
Data Quality Process
References
- DAMA UK → The six Primary dimensions for Data Quality assessment
- ntwi.org → The Six Cs of Trusted Data
- trustinsights.ai → The 6C Data Quality Framework
-
eccma.org → ISO 8000 – the international standard for data quality
-
| - | Column | - | - | - ---| --- | --- | --- | --- |--- | | | | Column Record | Record | Record | Record | Record | | | Column | Field | Field | | | Column | Field | Field | | | Column
json_map = {
"field1": "value",
"field2": 2,
"field3": ["well", "thats", 2, "cool"],
"field4": {"child_field1": "child_value"}
}