Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Upper and lower bounds tests and Inter Quartile Range Checks(IQR) and standard deviations
-
Aggregate level checks (after manipulating data, there should still be the ability to explain how the data aggregates back to the previous data set)
-
Tracking percentage of nulls and dropped columns (Define what is an acceptable amount)
-
Data Type Checks (This should be done earlier at the application level, as well as data value constraints e.g. WA is a state abbreviation KZ is not)
-
Tracking Data Inserts
-
Wherever data comes from, whether it is flat files, IPs, users, etc. This should all be tracked. Especially if it is specific files. If your team finds out that the data from a specific file was inaccurate. Then it would want to remove it. If you have tracked what file the data came from, this is easy.
Data Validation Tools
voluptuous
voluptuous is a Python data validation library.
Useful Libraries
-
pandas-profiling is tool for profiling pandas DataFrames. One possible way to work with large data is to do simple profiling on the large DataFrame and then sample a relative small data and use pandas-profiling to profile it.
-
great_expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
-
deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache Griffin supports data profiling but seems to be heavy and limited.
A GUI tool for data cleaning, profiling ,etc.
Commercial Solutions
Books
Python Business Intelligence Bookbook
References
https://towardsdatascience.com/introducing-pydqc-7f23d04076b3
https://medium.com/@SeattleDataGuy/good-data-quality-is-key-for-great-data-science-and-analytics-ccfa18d0fff8
https://dzone.com/articles/java-amp-apache-spark-for-data-quality-amp-validat