Spark Issue: Data Skew on Shuffle Phase

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptom

org.apache.spark.shuffle.FetchFailedException: Too large frame: 2200180718 Caused by: java.lang.IllegalArgumentException: Too large frame: 2200289525 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)

Reason

There is data skew in some column(s).

Solution

split and broadcast
add another random column to help reduce skew
```
joinWithoutSkew(df1:DataFrame, df2:DataFrame, joinCols:Array[Column], duplicationNum:Int): DateFrame = {
  ...
}
```
df1: bigtable, append random num to the join columns. keep the row count no change. df2: smalltable, duplicate df2 by duplicationNum times and got df2Duplicate, df2Duplicate.count = df2.count * duplicationNum append random num to the join columns. joinCols: the joinCols to be join on duplicationNum: duplication nums of df2

References

https://docs.databricks.com/delta/join-performance/skew-join.html

https://dataengi.com/2019/02/06/spark-data-skew-problem/

https://itnext.io/handling-data-skew-in-apache-spark-9f56343e58e8

Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: Data Skew on Shuffle Phase

Symptom

Reason

Solution

References

Comments