Spark Issue: Runtimeerror: Arrow Legacy IPC Format Is Not Supported

Apr 03, 2022

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

RuntimeError: Arrow legacy IPC format is not supported in PySpark, please unset ARROW_PRE_0_15_IPC_FORMAT

Possible Causes

You are using PySpark 3.0+ with one (or both) of the following options.

--conf …

Spark Issue: TypeError WithReplacement

Dec 17, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].

Causes

An integer number (e.g., 1) is passed to the fraction parameter …

Rust and Spark

Oct 10, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The simplest and best way is to leverage pandas_udf in PySpark. In the pandas UDF, you can call subprocess.run to run any shell command and capture its output.

from pathlib …

Packaging Python Dependencies for PySpark Using Pex

Feb 07, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

python-build-standalone is a better alternative to conda-pack on managing Python dependencies for PySpark. Please refer to Packaging Python Dependencies for PySpark Using python-build-standalone for tutorials on how to use python-build-standalone to …

Logging in PySpark

Jun 15, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Excessive logging is better than no logging! This is generally true in distributed big data applications.
Use loguru if it is available. If you have to use the logging module, be …

Subtle Differences Among Spark DataFrame and PySpark Dataframe

Feb 19, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …

Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: Runtimeerror: Arrow Legacy IPC Format Is Not Supported

Symptoms

Possible Causes

Spark Issue: TypeError WithReplacement

Symptoms

Causes

Rust and Spark

Packaging Python Dependencies for PySpark Using Pex

Logging in PySpark

Subtle Differences Among Spark DataFrame and PySpark Dataframe