Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Tips and Traps¶
Alternatives to pandas for Small Data¶
- Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model. It is the best replacement of pandas for small data at this time. Notice that Polars support multithreading and lazy computation but it cannot handle data larger than memory at this time.
Tips on Datafusion
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Spark Issue: _Pickle.Picklingerror: Args[0] from __Newobj__ Args Has the Wrong Class
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Please refer to Spark Issue: Task Not Serializable for a similar serialization issue in Spark/Scala.
Symptom
Cause
For example, if you have the following import
from nltk.corpus import stopwords …
Window Functions in Spark
Window with orderBy¶
It is tricky!!!
If you provide ORDER BY clause then the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:
https://stackoverflow.com/questions/52273186/pyspark-spark-window-function-first-last-issue
Avoid using last and use first with
descending order by
instead. This gives less surprisings.Do NOT use order by if not necessary. It introduces unnecessary ...
Tips on Delta Lake
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Delta Table
convert to delta [db_name.]table_name [partitioned by ...] [vacuum [retain number hours]]
vaccum
describe history db_name.table_name
can select from historical snapshot can also rollback to a historical snapshot rollback …
Hive SQL
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Hive is case-insensitive, both keywords and functions
-
You can use both double and single quotes for strings
-
use
=
rather than==
for equality comparison but it seems that==
also works -
use
%
rather …