Ben Chuanlong Du's Blog

It is never too late to learn.

DataFrame Implementations in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps

Alternatives to pandas for Small Data

  1. Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model. It is the best replacement of pandas for small data at this time. Notice that Polars support multithreading and lazy computation but it cannot handle data larger than memory at this time.

Working with Big Data

  1. There are multiple ways to handle big data in Python, among which PySpark and vaex are the best ones. Dask is NO GOOD compared to PySpark and vaex .

  2. Benchmarking Python Distributed AI Backends with Wordbatch has a detailed comparison among Dask, Ray and PySpark. Dask is no good. Both Ray and PySpark scale well with Ray has slight performance advantge over PySpark. Notice that modin is a project aiming at scaling pandas workflows by changing one line of code and it is based on Apache Ray. It will probably provide better performance than Dask if you work with data frames.

  3. If you have relative large memory, say more than 20G, on your (single) machine, you can handle (filtering, sorting, merging, etc.) millions (magnitude of 1E6) of rows in pandas DataFrame without any pressure. When you have more than 10 millions rows or the memory on your (single) machine is restricted, you should consider using big data tools such as PySpark and vaex .

  4. Do NOT use the Jupyter/Lab plugin jupyterlab-lsp if you work on big data in Jupyter/Lab. The plugin jupyterlab-lsp has issues with large DataFrames (both with pandas and PySpark DataFrames) and can easily crash your Jupyter/Lab server even if you have enough memory.

Pandas DataFrame

Pandas DataFrame is the most popular DataFrame implementation that people use Python.

polars

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model.

  1. It is the best replacement of pandas for small data at this time.
  2. Polars support multithreading and lazy computation.
  3. Polars cannot handle data larger than memory at this time (even though this might change in future).

PySpark DataFrame

PySpark DataFrame is a good option if you have to work on relatively large data on a single machine, especially if you have some Spark knowledge.

vaex

It seems to me that vaex misses some critical features. I do not suggest users try it at this time.

Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

cylon

Cylon use similar technologies (C++, Apache Arrow, etc.) as vaex. However, it doesn't seems to be as mature as vaex. A few advantages of Cylon compared to vaex are

  • cylon supports different langauge APIs (C++, Python, Java, etc)
  • cylon is distributed while vaex is single machine only

Cylon is a fast, scalable distributed memory data parallel library for processing structured data. Cylon implements a set of relational operators to process data. While "Core Cylon" is implemented using system level C/C++, multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. By default it works with MPI for distributing the applications. Internally Cylon uses Apache Arrow to represent the data in a column format.

cudf

cudf (developed by RAPIDS) is built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

modin

Modin, with Ray as a backend. By installing these, you might see significant benefit by changing just a single line (import pandas as pd to import modin.pandas as pd). Unlike the other tools, Modin aims to reach full compatibility with Pandas.

Modin: a drop-in replacement for Pandas, powered by either Dask or Ray.

dask.DataFrame

dask.DataFrame is not as good as other DataFrame implementations presented here. I'd suggest you try other alternatives (vaex or PySpark DataFrame).

Dask is a low-level scheduler and a high-level partial Pandas replacement, geared toward running code on compute clusters. Dask provides dask.dataframe, a higher-level, Pandas-like library that can help you deal with out-of-core datasets.

In [ ]:
 

Comments