Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: Could Not Execute Broadcast in 300S

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

Caused by: org.apache.spark.SparkException: Could not execute broadcast in 600 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting …

How Much to Push for Functional Programming and Immutability

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Most new programming languages (such as Rust, Go, Kotlin, Scala, etc.) support functional programming style and have clear distinguishment on mutable vs immutable variables. So, is functional programming superior to imperative …

Partition and Bucketing in Spark

Tips and Traps

  1. Bucketed column is only supported in Hive table at this time.

  2. A Hive table can have both partition and bucket columns.

  3. Suppose t1 and t2 are 2 bucketed tables and with the number of buckets b1 and b2 respecitvely. For bucket optimization to kick in when joining them:

     - The 2 tables must be bucketed on the same keys/columns.
     - Must joining on the bucket keys/columns.
     - `b1` is a multiple of `b2` or `b2` is a multiple of `b1`.
    
    

Sum Type in Rust

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Enum is the preferred way to constrcut a sum type of several types (which does not implemente the same trait).

The Rust crate either provides an enum Either (with variants Left …

Spark Issue: Pure Python Code Errors

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

This post collects some typical pure Python errors in PySpark applications.

Symptom 1

object has no attribute

Solution 1

Fix the attribute name.

Symptom 2

No such file or directory

Solution …