New Features in Spark 3

Jun 27, 2020

AQE (Adaptive Query Execution)¶

To enable AQE, you have to set spark.sql.adaptive.enabled to true (using --conf spark.sql.adaptive.enabled=true in spark-submit or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)

Pandas UDFs¶

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data to Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using pandas_udf

Null Values in Inner Join of Spark Dataframes

Jun 17, 2020

Logging in PySpark

Jun 15, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Excessive logging is better than no logging! This is generally true in distributed big data applications.
Use loguru if it is available. If you have to use the logging module, be …

Use LightGBM With Spark

Dec 05, 2019

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md

MMLSpark seems to be the best option to use train models using LightGBM on a Spark cluster. Note that MMLSpark requires …

Build Spark from Source

Feb 20, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

You can download prebuilt binary Spark at https://spark.apache.org/downloads.html. This is where you should get started and it will likely satisfy your need most of the time …

Subtle Differences Among Spark DataFrame and PySpark Dataframe

Feb 19, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

New Features in Spark 3

AQE (Adaptive Query Execution)¶

Pandas UDFs¶

Null Values in Inner Join of Spark Dataframes

Logging in PySpark

Use LightGBM With Spark

Build Spark from Source

Subtle Differences Among Spark DataFrame and PySpark Dataframe