Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Data Validation Tools

voluptuous

voluptuous is a Python data validation library.

Useful Libraries

  1. pandas-profiling

    pandas-profiling is tool for profiling pandas DataFrames. One possible way to work with large data is to do simple profiling on the large DataFrame and then sample a relative small data and use pandas-profiling to profile it.

  2. great_expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.

  3. deequ is a library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets.

  4. Optimus

  5. dbsanity

  6. Apache Griffin

Apache Griffin supports data profiling but seems to be heavy and limited.

  1. DataCleaner

A GUI tool for data cleaning, profiling ,etc.

  1. Open Stduio for Data Quality

Commercial Solutions

  1. Trifacta

Books

Python Business Intelligence Bookbook

References

https://towardsdatascience.com/introducing-pydqc-7f23d04076b3

https://medium.com/SeattleDataGuy/good-data-quality-is-key-for-great-data-science-and-analytics-ccfa18d0fff8

https://dzone.com/articles/java-amp-apache-spark-for-data-quality-amp-validat