Tips and Traps¶
It is suggested that you use multiprocessing (e.g.,
pool_size=8
) to speed up data profiling. Note: It seems to me that currently multiprocessing only works whenminimal=True
.minimal=True
helps reuce consumed memory.profile = ProfileReport( df, title="Data Profiling Report", explorative=True, minimal=True, pool_size=8 )
Convert Pandas DataFrame to Other Format
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
DataFrame Implementations in Python
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Tips and Traps¶
Alternatives to pandas for Small Data¶
- Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model. It is the best replacement of pandas for small data at this time. Notice that Polars support multithreading and lazy computation but it cannot handle data larger than memory at this time.
Data Types in Different Programming Languages
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Data Type | C | C++ | Rust | Java | Python | numpy | pyarrow | Spark SQL | SQL |
---|---|---|---|---|---|---|---|---|---|
8 bit integer | short (16-bit) | int8_t | i8 | short (16-bit) | int (arbitrary precision) | int8 | TinyInt … |
Koalas is pandas API on PySpark
References¶
https://github.com/databricks/koalas
https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions