Tips on Fbs

Jul 13, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

https://build-system.fman.io/

https://github.com/mherrmann/fbs-tutorial

New Features in Spark 3

Jun 27, 2020

AQE (Adaptive Query Execution)¶

To enable AQE, you have to set spark.sql.adaptive.enabled to true (using --conf spark.sql.adaptive.enabled=true in spark-submit or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)

Pandas UDFs¶

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data to Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using pandas_udf

Use pyarrow to Share Data in Memory in Python

Jun 24, 2020

References¶

https://github.com/apache/arrow

https://stackoverflow.com/questions/54582073/sharing-objects-across-workers-using-pyarrow

https://github.com/pytorch/pytorch/issues/13039

https://issues.apache.org/jira/browse/ARROW-5130

https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html

Use a Class in the Definition of the Class in Python

Jun 21, 2020

Comments¶

As long as the class name is not need at definition time of the class, it is OK to use it.

You cannot use a class in default values of the __init__ function of the class.

Check Whether a Python Object Is Callable

Jun 20, 2020

Broadcast Join in Spark

Jun 18, 2020

Tips and Traps¶

BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.