Arithmetic Functions and Operations in Spark

Sep 05, 2020

Collection Functions in Spark

Sep 05, 2020

If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.

Jun 17, 2020

Dec 13, 2019

May 03, 2020

It is tricky!!!

If you provide ORDER BY clause then the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:

Avoid using last and use first with descending order by instead. This gives less surprisings.
Do NOT use order by if not necessary. It introduces unnecessary ...

May 19, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Delta Lake

convert to delta [db_name.]table_name [partitioned by ...] [vacuum [retain number hours]]

vaccum

describe history db_name.table_name

can select from historical snapshot can also rollback to a historical snapshot rollback …