Ben Chuanlong Du's Blog

It is never too late to learn.

Data Profiling Tools

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. ydata-profiling

    ydata-profiling (successor to pandas-profiling) is tool for profiling pandas and Spark DataFrames. One possible way to work with large data is to do simple profiling on the large DataFrame and …

Profiling Data Using ydata-profiling

Tips and Traps

  1. It is suggested that you use multiprocessing (e.g., pool_size=8) to speed up data profiling. Note: It seems to me that currently multiprocessing only works when minimal=True.

  2. minimal=True helps reuce consumed memory.

     profile = ProfileReport(
         df, title="Data Profiling Report", 
         explorative=True, minimal=True, pool_size=8
     )

Data Quality

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  • Upper and lower bounds tests and Inter Quartile Range Checks(IQR) and standard deviations

  • Aggregate level checks (after manipulating data, there should still be the ability to explain how the data …