TSI Data Quality Principles

About

This living document is aims to capture our Data Quality values and standards to set scene for better, cleaner data throughout the Telkom group. In short we seek to ensure veracity and trustworthy data by setting a high standard for Data Quality.

Data Quality Levels/Perspectives

Data Quality can bee seen from 3 perspectives: science, engineering and management. From a data science perspective one observes the statistical dimensions of data, from an engineering perspective the logical dimensions, and from management perspective the physical dimensions. These are all equally important, but are observed at different parts of the data life-cycle and the observations have different lifespans.

Logical Level (The Engineering Perspective)

Logical dimensions exist to prevent irregularities on the lowest level by ensuring consistancy between the data, the schema and the database. The standard for logical dimensions should be set on the highest level of the organisation to ensure consistency between all databases and datsets throughout the organisation. Dataset specific standards can also be defined by the data owner and these will take precidence over the standars on the org level. Once the standard for a dataset has been developed, it can be applied in an automated fashion prior to ingestion into a warehouse environment, ensuring all the data in long term storage conforms to the organisation standards for data on the lowest level before it rests. Logical dimensions are therefore the first to be checked, and as long as the checks are done on all data prior to ingestion, the checks would only need to be done once-off with additional checks done on data already in the warehouse periodiocally.

Physical Level (The Management Perspective)

Physical dimensions look at the values within the data and metadata to ensure compliance to rules set by the organisation and data owner(captured in the data dictionary) at all times. Metadata exists on a number of levels (field, record, table, dataset) and includes things such as ownership, access, integrity and compliance to validation rules. These validation rules and other checks can be set on an organisation level for data that spans multiple projects, but in most instances projects will have unique requirements that warrent thier own set of rules and checks. All rules and checks should be constantly tracked to ensure compliance is never broken. This is usually done through jobs and security policies set in the data warehouse.

Statistical Level (The Science Perspective)

Statistical dimensions help the data scientists understand the data and consists of the first things a data scientist would normally look at when starting to analyse a dataset. Some of the more basic statistical checks are useful to ensure that the data is indeed in a desirable state. The data scientists should be able to start using the dataset without having to do any cleaning and if any errors or shortcoming are found, the data scientists should infrom data management and data engineering so the problem can be fixed upstream, as close to source as possible.

Data Quality Values

The main goal of Data Quality is to ensure that data is trustworthy. Without trust, users will create thier own data stores and the single source of truth would be lost. By defining what we value in terms of Data Quality, we are able to ensure the data is trustworthy and we can then use those values to create standards which ensure all our data quality values are upheld.

To cover all the bases we researched a few articles that spoke about the various dimensions of Data Quality. Below are links to the articles along with a key to represent them in the tables below:

6C's → The Six Cs of Trusted Data
DAMA UK → The six Primary dimensions for Data Quality assessment
6D's → 6 Dimensions of Data Quality
7D's → Dimensions of data quality
DAMA NY → Data Quality Fundamentals
7C's → Seven Characteristics That Define Quality Data

After analysing these articles, we mapped all the different dimensions to the five values which we feel captures :

TSI Data Quality Value ↓	6C's	DAMA UK's	6D's	7D's	DAMA NY's	7C's
Completeness	Complete	Completeness	Completeness & Integrity	Completeness & Precision	Completeness	Completeness and Comprehensiveness
Consistency	Consistent	Consistency	Consistency	Reliability	Semantic, Structure, Reasonability & Consistency	Reliability and Consistency
Cleanliness	Clean	Uniqueness & Accuracy	Accuracy	Accuracy	Accuracy & Identifiability	Accuracy and Precision & Granularity and Uniqueness
Compliance	Compliant & Collaborative	Validity	Conformity	Integrity & Confidentiality	Lineage	Availability and Accessibility & Legitimacy and Validity
Currency	Current	Timeliness	Timeliness	Timeliness	Timeliness & Currency	Timeliness and Relevance

For each of these five selected values, we need to analyse Data Quality from all three perspectives to create a detailed picture of Data Quality:

1. Completeness

We want to ensure that data sets are complete by checking all dimensions against the source, and also by providing all data required for context and questions that may arise.

Logical Level
- Structured Data:
  - Ensure tables contain all the correct columns
  - Ensure data sets contain all the correct tables
  - Ensure columns that require values are set as not nullable in DB
    - Unstructured Data:
  - Ensure maps contain all the correct fields
  - Ensure collections contain all the correct maps
  - Ensure data sets contain all the correct collections
Physical Level
- All Data:
  - Ensure that the number of null values does not exceed allowed tolerances
- Structured Data:
  - Ensure tables contain all the correct records
  - Track the number of fields that contain null values per column
- Unstructured Data:
  - Ensure collections contain all the correct maps
  - Track the number of fields that contain null values per map
Statistical Level
- Should no longer be concerned with basic completeness, but rather the knowledge and wisdom within the data. Sometimes a raw data set is not enough on it's own to answer our questions, in which the case Data Science should point out the gaps to Data Management who in turn can inform Data Engineering of where to locate additional data and how it shoukd be integrated into the existing data set.

2. Consistency

We want to ensure that data is consistent on all levels (field, record, table, dataset, metadata) by checking values accross databases and against the original soruce. Additionally, we need to ensure business definitions have been apllied and are conformed to.

Logical Level
- Structured Data:
  - Ensure all columns are set to the correct type
  - Ensure all columns are named according to the standards
- Unstructured Data:
  - Ensure all fields within a map are set to the correct type
  - Ensure all fields are named according to the standards
Physical Level
- All Data:
  - Ensure that values remain consistant accorss diffrent databases and datasets
- Structured Data:
  - Ensure all values within a column conform to the assigned type
  - Ensure all values within a column conform to standards that apply to the column's type
- Unstructured Data:
  - Ensure all values within a map conform to the correct type
  - Ensure all values within a map conform to standards that apply to that field's type
Statistical Level
- Ensure definitions within the data confrom to definitions in the standards
- Ensure relationships within the data and between the tables remain intact
- Ensure that the way data was captured did not change over time
- Ensure that values in the data are consistent with expecatations created by previous data

3. Cleanliness

We want to ensure that our data is clean and an accurate representation of the real world. If it is not, we want to know if the data can be cleaned and to what extent it needs to be cleaned. If all the other DQ values are upheald, the data should be very close to clean.

Logical Level
- Ensure each table has a unique primary key
- Ensure that there is no duplication of columns within the table
- Ensure that there is no duplication of tables within the dataset
- Ensure that there is no duplication this dataset elsewhere in the database
Physical Level
- Ensure that there is no duplication of records within the table
- Ensure cleanlines is mainteined by running the data verification proccess for the datatset at least once per specified time period
- Ensure the dataset's score in each of the DQ values does not fall below the specified threshhold
Statistical Level
- Ensure the data is objectivly verifiable and measured/collected correctly
- Ensure the level of detail matches the requirements, aggregation and transfromation results should be stored seperatly

4. Compliance

Logical Level
Physical Level
- column/key/collection/document names conform
- Ensure each of the DQ values dont deterioritae belown a certin threshold
- Ensure the data confroms to all rules and regulations set all of the following levels
  - Data Dictionary (The standards set in the data dictionary)
  - TSI (The standards set in this document)
  - Telkom (Data/Digital Privacy Policy)
  - South Africa (PoPI)
- Ensure data values comply with all validations
- Ensure proof of compliance is maintained
- Ensure the DQ process is automated as much as possible to remove human error/bias
- Ensure accessabilioty by promoting the dataset and allowingusers to apply for access
Statistical Level *

5. Currency

Logical Level
- Ensure datasets always remain true to source by making sure a connection to the source database is possible and an automated pipeline exists
Physical Level
- Ensure time series data is an accurate representation of real world events
- Ensure datasets remain highly availbale to all users at all times to enable timly analysis
Statistical Level

Data Quality Standards

This section is still very much a work in progress. Standards should mostly be set at source or on organisational level and fed into TSI. For now this is a guideline list of requirements from a predominantly engineering perspective and needs input from the other viewpoints.

1. Completeness

Logical Level
1. Specify which columns/keys do not require values
2. Specify tolerance percentage for missing values in nullable and not nullable columns/keys
3. Columns/keys implement nullable and not nullable in source DB
Physical Level
1. Specify all the correct records/keys in a table/map
2. Specify all the correct columns/maps in a table/collection
3. Specify all the correct tables/collections in a data set

2. Consistency

Data Quality Tests

1. Completeness

Level	Data Type	Test	Test Output	Standard	Test Condition Definer	Tester	Reporter	Reference
Logical	Structured	Required columns are `not nullable` in DB	List of misconfigured columns	1.1.3	Data Dictionary	DE	DE	DAMA UK: Completeness & DAMA NY: Completeness
Logical	Structured	`null` Field check/count	Percentage of `null` fields	1.1.1 & 1.1.2	Data Dictionary	DE	DM	DAMA UK: Completeness & DAMA NY: Completeness
Logical	Unstructured	`null` Value check/count	Percentage of `null` values	1.1.1 & 1.1.2	Data Dictionary	DE	DM	DAMA UK: Completeness & DAMA NY: Completeness
Physical	Structured	All columns present	List of missing columns	1.2.2	Data Dictionary	DM	DM	6C's: Complete
Physical	Structured	All records present	Range of missing records	1.2.1	Data Dictionary	DM	DM	6C's: Complete
Physical	Structured	All tables present	List of missing tables	1.2.3	Data Dictionary	DM	DM	6C's: Complete
Physical	Unstructured	All maps present	List of missing maps	1.2.2	Data Dictionary	DM	DM	6C's: Complete
Physical	Unstructured	All keys present	List of missing keys	1.2.1	Data Dictionary	DM	DM	6C's: Complete
Physical	Unstructured	All collections present	List of missing collections	1.2.3	Data Dictionary	DM	DM	6C's: Complete
Physical	All Data	No irrelevant or confusing data is present	Possibly irrelevant or confusing data	-	Data Dictionary	DM	DM	6C's: Complete
Statistical	All Data	Is the data comprehensive and sufficient enough?	Gaps in the data set	-	DS	DS	DM	6D's: Completeness, 7D's Completeness & 7C's Completeness and Comprehensiveness
### 2. Consistency

Level | Test | Test Output | Standard | Test Condition Definer | Tester | Reporter| Reference --- | --- | --- | --- | --- | --- | --- | --- | --- Consistency | | | | | | Consistency | | | | | | Consistency | | | | | | Consistency | | | | | | Consistency | | | | | |

Data Quality Process

References

DAMA UK → The six Primary dimensions for Data Quality assessment
ntwi.org → The Six Cs of Trusted Data
trustinsights.ai → The 6C Data Quality Framework
eccma.org → ISO 8000 – the international standard for data quality
| - | Column | - | - | - ---| --- | --- | --- | --- |--- | | | | Column Record | Record | Record | Record | Record | | | Column | Field | Field | | | Column | Field | Field | | | Column

json_map = {
    "field1": "value",
    "field2": 2,
    "field3": ["well", "thats", 2, "cool"],
    "field4": {"child_field1": "child_value"}
}

TSI Data Quality Principles

Contents

About

Data Quality Levels/Perspectives

Logical Level (The Engineering Perspective)

Physical Level (The Management Perspective)

Statistical Level (The Science Perspective)

Data Quality Values

1. Completeness

2. Consistency

3. Cleanliness

4. Compliance

5. Currency

Data Quality Standards

1. Completeness

2. Consistency

Data Quality Tests

1. Completeness

Data Quality Process

References