Profiling Data Using ydata-profiling

Jul 01, 2020

Tips and Traps¶

It is suggested that you use multiprocessing (e.g., pool_size=8) to speed up data profiling. Note: It seems to me that currently multiprocessing only works when minimal=True.

minimal=True helps reuce consumed memory.

 profile = ProfileReport(
     df, title="Data Profiling Report", 
     explorative=True, minimal=True, pool_size=8
 )

Convert Pandas DataFrame to Other Format

Apr 25, 2022

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

DataFrame Implementations in Python

Nov 26, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps¶

Alternatives to pandas for Small Data¶

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model. It is the best replacement of pandas for small data at this time. Notice that Polars support multithreading and lazy computation but it cannot handle data larger than memory at this time.

Data Types in Different Programming Languages

Dec 21, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Data Type	C	C++	Rust	Java	Python	numpy	pyarrow	Spark SQL	SQL
8 bit integer	short (16-bit)	int8_t	i8	short (16-bit)	int (arbitrary precision)	int8		TinyInt …

Koalas is pandas API on PySpark

Dec 11, 2020

References¶

https://github.com/databricks/koalas

https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html

https://notebooks.gesis.org/binder/jupyter/user/databricks-koalas-mxw72n1l/notebooks/docs/source/getting_started/10min.ipynb

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

Filter pandas DataFrames in Python

Nov 21, 2020

← Older Newer →