Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
https://build-system.fman.io/
https://github.com/mherrmann/fbs-tutorial
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
https://build-system.fman.io/
https://github.com/mherrmann/fbs-tutorial
To enable AQE,
you have to set spark.sql.adaptive.enabled
to true
(using --conf spark.sql.adaptive.enabled=true
in spark-submit
or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)
Pandas UDFs are user defined functions
that are executed by Spark using Arrow
to transfer data to Pandas to work with the data,
which allows vectorized operations.
A Pandas UDF is defined using pandas_udf
https://github.com/apache/arrow
https://stackoverflow.com/questions/54582073/sharing-objects-across-workers-using-pyarrow
https://github.com/pytorch/pytorch/issues/13039
https://issues.apache.org/jira/browse/ARROW-5130
https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html
You cannot use a class in default values of the __init__
function of the class.
BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.