Ben Chuanlong Du's Blog

It is never too late to learn.

DataFrame Implementations in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps

Alternatives to pandas for Small Data

  1. Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model. It is the best replacement of pandas for small data at this time. Notice that Polars support multithreading and lazy computation but it cannot handle data larger than memory at this time.

Tips on Datafusion

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips on Delta Lake

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Delta Lake

Delta Table

convert to delta [db_name.]table_name [partitioned by ...] [vacuum [retain number hours]]

vaccum

describe history db_name.table_name

can select from historical snapshot can also rollback to a historical snapshot rollback …

Hive SQL

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Hive is case-insensitive, both keywords and functions

  2. You can use both double and single quotes for strings

  3. use = rather than == for equality comparison but it seems that == also works

  4. use % rather …