Hadoop Filesystem Tips

Jan 21, 2014

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps

It is suggested that you never use the -skipTrash option unless you are absolutely aware of what you are doing. I made mistakes a couple of times in …

Spark Issue: Pure Python Code Errors

Mar 22, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

This post collects some typical pure Python errors in PySpark applications.

Symptom 1

object has no attribute

Solution 1

Fix the attribute name.

Symptom 2

No such file or directory

Solution …

Spark SQL

Feb 20, 2019

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark SQL Guide

Since a Spark DataFrame is immutable, you cannot update or delete records from a physical table (e.g., a Hive table) directly using Spark DataFrame/SQL API. However …

Process Big Data Using Spark

Jan 05, 2017

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

General Tips

Please refer to Spark SQL for tips specific to Spark SQL.
It is almost always a good idea to filter out null value in the joinining columns before joining …

A Comprehensive List of Common Issues in Spark Applications

Aug 22, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

List of Common Issues

Please refer to http://www.legendu.net/misc/tag/spark-issue.html for a comprehensive list of Spark Issues and (possible) causes and solutions.

Debugging Tips

Spark/Hadoop …

Rust and Spark

Oct 10, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The simplest and best way is to leverage pandas_udf in PySpark. In the pandas UDF, you can call subprocess.run to run any shell command and capture its output.

from pathlib …

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

Hadoop Filesystem Tips

Tips and Traps

Spark Issue: Pure Python Code Errors

Symptom 1

Solution 1

Symptom 2

Solution …

Spark SQL

Process Big Data Using Spark

General Tips

A Comprehensive List of Common Issues in Spark Applications

List of Common Issues

Debugging Tips

Spark/Hadoop …

Rust and Spark