UDF in Spark

Sep 05, 2020

Comments¶

Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a blackbox for Spark and so it does not even try to optimize them.

Rename and Drop Columns in Spark DataFrames

Jul 19, 2020

Comment¶

You can use withColumnRenamed to rename a column in a DataFrame. You can also do renaming using alias when select columns.

Sort DataFrame in Spark

Jul 04, 2020

Comments¶

After sorting, rows in a DataFrame are sorted according to partition ID. And within each partition, rows are sorted. This property can be leverated to implement global ranking of rows. For more details, please refer to Computing global rank of a row in a DataFrame with Spark SQL. However, notice that multi-layer ranking is often more efficiency than a global ranking in big data applications.

New Features in Spark 3

Jun 27, 2020

AQE (Adaptive Query Execution)¶

To enable AQE, you have to set spark.sql.adaptive.enabled to true (using --conf spark.sql.adaptive.enabled=true in spark-submit or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)

Pandas UDFs¶

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data to Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using pandas_udf

Null Values in Inner Join of Spark Dataframes

Jun 17, 2020

Empty DataFrames in Spark

Jun 17, 2020

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

UDF in Spark

Comments¶

Rename and Drop Columns in Spark DataFrames

Comment¶

Sort DataFrame in Spark

Comments¶

New Features in Spark 3

AQE (Adaptive Query Execution)¶

Pandas UDFs¶

Null Values in Inner Join of Spark Dataframes

Empty DataFrames in Spark