Tips and Traps¶
You can use the
split
function to split a delimited string into an array. It is suggested that removing trailing separators before you apply thesplit
function. Please refer to the split section before for more detailed discussions.Some string functions (e.g.,
right
, etc.) are available in the Spark SQL APIs but not available as Spark DataFrame APIs.
The Case Statement and the when Function in Spark
Tips and Traps¶
Watch out for NaNs ..., behave might not what you expect ...
None can be used for otherwise and yield null in DataFrame.
Column alias and postional columns can be used in group by in Spark SQL!!!
Notice the function when
behaves like if-else
.
Arithmetic Functions and Operations in Spark
Collection Functions in Spark
Tips and Traps¶
If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.
Rounding Functions in Spark
Functions in Bash
By default, variables defined in a function are global, i.e., they are visible outside the function too.