Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
ydata-profiling (successor to pandas-profiling) is tool for profiling pandas and Spark DataFrames. One possible way to work with large data is to do simple profiling on the large DataFrame and then sample a relative small data and use pandas-profiling to profile it.
-
great_expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
-
Optimus is the one that is closest to what I want to achieve so far. Looks promissing.
-
Apache Griffin supports data profiling but seems to be heavy and limited.
Other Adhoc Examples
https://towardsdatascience.com/profiling-big-data-in-distributed-environment-using-spark-a-pyspark-data-primer-for-machine-78c52d0ce45
http://www.bigdatareflections.net/blog/?p=111