Collection Functions in Spark
Tips and Traps¶
If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.
Rounding Functions in Spark
Statistical Functions in Spark
Window Functions in Spark
Window with orderBy¶
It is tricky!!!
If you provide ORDER BY clause then the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:
https://stackoverflow.com/questions/52273186/pyspark-spark-window-function-first-last-issue
Avoid using last and use first with
descending order by
instead. This gives less surprisings.Do NOT use order by if not necessary. It introduces unnecessary ...
Tips on Delta Lake
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Delta Table
convert to delta [db_name.]table_name [partitioned by ...] [vacuum [retain number hours]]
vaccum
describe history db_name.table_name
can select from historical snapshot can also rollback to a historical snapshot rollback …