Ben Chuanlong Du's Blog

It is never too late to learn.

Collection Functions in Spark

Tips and Traps

  1. If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.

Tips on Delta Lake

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Delta Lake

Delta Table

convert to delta [db_name.]table_name [partitioned by ...] [vacuum [retain number hours]]

vaccum

describe history db_name.table_name

can select from historical snapshot can also rollback to a historical snapshot rollback …