Ben Chuanlong Du's Blog

It is never too late to learn.

Rust and Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The simplest and best way is to leverage pandas_udf in PySpark. In the pandas UDF, you can call subprocess.run to run any shell command and capture its output.

from pathlib …

Yarn for Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. List all Spark applications.

    yarn application --list
    
  2. Show status of a Spark application.

    yarn application -status application_1459542433815_0002
    
  3. view logs of a Spark application.

    yarn logs -applicationId application_1459542433815_0002
    
  4. kill a Spark application …

Spark Issue Libc Not Found

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptom

/lib64/libc.so.6: version `GLIBC_2.18' not found (required by ...)

Cause

The required version of GLIBC by the binary executor is not found on Spark nodes.

Solution

Recompile your …

Data Quality

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  • Upper and lower bounds tests and Inter Quartile Range Checks(IQR) and standard deviations

  • Aggregate level checks (after manipulating data, there should still be the ability to explain how the data …