Ben Chuanlong Du's Blog

It is never too late to learn.

Use TableSample in SQL

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The limit clause (or the method DataFrame.limit if you are using Spark) is a better alternative if randomness is not critical.



Improve the Performance of Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Plan Your Work

  1. Have a clear idea about what you want to do is very important, especially when you are working on an explorative project. It often saves you time to …

Spark SQL

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Spark SQL Guide

  1. Since a Spark DataFrame is immutable, you cannot update or delete records from a physical table (e.g., a Hive table) directly using Spark DataFrame/SQL API. However …

String Functions in Spark

Tips and Traps

  1. You can use the split function to split a delimited string into an array. It is suggested that removing trailing separators before you apply the split function. Please refer to the split section before for more detailed discussions.

  2. Some string functions (e.g., right, etc.) are available in the Spark SQL APIs but not available as Spark DataFrame APIs.

The Case Statement and the when Function in Spark

Tips and Traps

  1. Watch out for NaNs ..., behave might not what you expect ...

  2. None can be used for otherwise and yield null in DataFrame.

Column alias and postional columns can be used in group by in Spark SQL!!!

Notice the function when behaves like if-else.