Ben Chuanlong Du's Blog

It is never too late to learn.

Rust and Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The simplest and best way is to leverage pandas_udf in PySpark. In the pandas UDF, you can call subprocess.run to run any shell command and capture its output.

from pathlib …

Yarn for Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. List all Spark applications.

    yarn application --list
    
  2. Show status of a Spark application.

    yarn application -status application_1459542433815_0002
    
  3. view logs of a Spark application.

    yarn logs -applicationId application_1459542433815_0002
    
  4. kill a Spark application …

Spark Issue Libc Not Found

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptom

/lib64/libc.so.6: version `GLIBC_2.18' not found (required by ...)

Cause

The required version of GLIBC by the binary executor is not found on Spark nodes.

Solution

Recompile your …

String Functions in Spark

Tips and Traps

  1. You can use the split function to split a delimited string into an array. It is suggested that removing trailing separators before you apply the split function. Please refer to the split section before for more detailed discussions.

  2. Some string functions (e.g., right, etc.) are available in the Spark SQL APIs but not available as Spark DataFrame APIs.

The Case Statement and the when Function in Spark

Tips and Traps

  1. Watch out for NaNs ..., behave might not what you expect ...

  2. None can be used for otherwise and yield null in DataFrame.

Column alias and postional columns can be used in group by in Spark SQL!!!

Notice the function when behaves like if-else.