Comments¶
- After sorting, rows in a DataFrame are sorted according to partition ID. And within each partition, rows are sorted. This property can be leverated to implement global ranking of rows. For more details, please refer to Computing global rank of a row in a DataFrame with Spark SQL. However, notice that multi-layer ranking is often more efficiency than a global ranking in big data applications.
Monitoring and Alerting Tools
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Grafana seems like another great choice!!! https://grafana.com/grafana/download?platform=docker
Prometheus sounds like a good one!
ELKI sounds like a possible tool for monitoring & alerting
Argus another one …
New Features in Spark 3
AQE (Adaptive Query Execution)¶
To enable AQE,
you have to set spark.sql.adaptive.enabled to true
(using --conf spark.sql.adaptive.enabled=true in spark-submit
or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)
Pandas UDFs¶
Pandas UDFs are user defined functions
that are executed by Spark using Arrow
to transfer data to Pandas to work with the data,
which allows vectorized operations.
A Pandas UDF is defined using pandas_udf
Query Pandas Data Frames Using SQL
Use pyarrow to Share Data in Memory in Python
References¶
https://github.com/apache/arrow
https://stackoverflow.com/questions/54582073/sharing-objects-across-workers-using-pyarrow
https://github.com/pytorch/pytorch/issues/13039
https://issues.apache.org/jira/browse/ARROW-5130
https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html