Ben Chuanlong Du's Blog

It is never too late to learn.

Data Profiling Tools

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. ydata-profiling

    ydata-profiling (successor to pandas-profiling) is tool for profiling pandas and Spark DataFrames. One possible way to work with large data is to do simple profiling on the large DataFrame and then sample a relative small data and use pandas-profiling to profile it.

  2. great_expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.

  3. deequ

  4. Optimus

    Optimus is the one that is closest to what I want to achieve so far. Looks promissing.

  5. Apache Griffin

    Apache Griffin supports data profiling but seems to be heavy and limited.

Other Adhoc Examples

https://towardsdatascience.com/profiling-big-data-in-distributed-environment-using-spark-a-pyspark-data-primer-for-machine-78c52d0ce45

http://www.bigdatareflections.net/blog/?p=111

Comments