Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: Could Not Execute Broadcast in 300S

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

Caused by: org.apache.spark.SparkException: Could not execute broadcast in 600 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting …

Partition and Bucketing in Spark

Tips and Traps

  1. Bucketed column is only supported in Hive table at this time.

  2. A Hive table can have both partition and bucket columns.

  3. Suppose t1 and t2 are 2 bucketed tables and with the number of buckets b1 and b2 respecitvely. For bucket optimization to kick in when joining them:

     - The 2 tables must be bucketed on the same keys/columns.
     - Must joining on the bucket keys/columns.
     - `b1` is a multiple of `b2` or `b2` is a multiple of `b1`.
    
    

Spark Issue: Pure Python Code Errors

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

This post collects some typical pure Python errors in PySpark applications.

Symptom 1

object has no attribute

Solution 1

Fix the attribute name.

Symptom 2

No such file or directory

Solution …

Spark SQL

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark SQL Guide

  1. Since a Spark DataFrame is immutable, you cannot update or delete records from a physical table (e.g., a Hive table) directly using Spark DataFrame/SQL API. However …

Spark Issue: TypeError WithReplacement

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].

Causes

An integer number (e.g., 1) is passed to the fraction parameter …