Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: Runtimeerror: Arrow Legacy IPC Format Is Not Supported

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

RuntimeError: Arrow legacy IPC format is not supported in PySpark, please unset ARROW_PRE_0_15_IPC_FORMAT

Possible Causes

You are using PySpark 3.0+ with one (or both) of the following options.

--conf …

Spark Issue: TypeError WithReplacement

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].

Causes

An integer number (e.g., 1) is passed to the fraction parameter …

Rust and Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The simplest and best way is to leverage pandas_udf in PySpark. In the pandas UDF, you can call subprocess.run to run any shell command and capture its output.

from pathlib …

Logging in PySpark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Excessive logging is better than no logging! This is generally true in distributed big data applications.

  2. Use loguru if it is available. If you have to use the logging module, be …

Subtle Differences Among Spark DataFrame and PySpark Dataframe

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …