Ben Chuanlong Du's Blog

It is never too late to learn.

Hadoop Filesystem Tips

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps

  1. It is suggested that you never use the -skipTrash option unless you are absolutely aware of what you are doing. I made mistakes a couple of times in …

Spark Issue: Pure Python Code Errors

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

This post collects some typical pure Python errors in PySpark applications.

Symptom 1

object has no attribute

Solution 1

Fix the attribute name.

Symptom 2

No such file or directory

Solution …

Spark SQL

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark SQL Guide

  1. Since a Spark DataFrame is immutable, you cannot update or delete records from a physical table (e.g., a Hive table) directly using Spark DataFrame/SQL API. However …

Process Big Data Using Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

General Tips

  1. Please refer to Spark SQL for tips specific to Spark SQL.

  2. It is almost always a good idea to filter out null value in the joinining columns before joining …

Rust and Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The simplest and best way is to leverage pandas_udf in PySpark. In the pandas UDF, you can call subprocess.run to run any shell command and capture its output.

from pathlib …