Tips on Fbs

Jul 13, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

https://build-system.fman.io/

https://github.com/mherrmann/fbs-tutorial

Sort DataFrame in Spark

Jul 04, 2020

Comments¶

After sorting, rows in a DataFrame are sorted according to partition ID. And within each partition, rows are sorted. This property can be leverated to implement global ranking of rows. For more details, please refer to Computing global rank of a row in a DataFrame with Spark SQL. However, notice that multi-layer ranking is often more efficiency than a global ranking in big data applications.

New Features in Spark 3

Jun 27, 2020

AQE (Adaptive Query Execution)¶

To enable AQE, you have to set spark.sql.adaptive.enabled to true (using --conf spark.sql.adaptive.enabled=true in spark-submit or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)

Pandas UDFs¶

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data to Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using pandas_udf

Query Pandas Data Frames Using SQL

Jun 27, 2020

framequery ¶

Use pyarrow to Share Data in Memory in Python

Jun 24, 2020

References¶

https://github.com/apache/arrow

https://stackoverflow.com/questions/54582073/sharing-objects-across-workers-using-pyarrow

https://github.com/pytorch/pytorch/issues/13039

https://issues.apache.org/jira/browse/ARROW-5130

https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html

Environment Variables in Shell

Jun 24, 2020

export¶

A new child process forked from a parent process does not inherit parent's variables by default. The export command marks an environment variable to be exported with any newly forked child processes and thus it allows a child process to inherit all marked variables.

unset¶

References¶

https://stackoverflow.com/questions/6877727/how-do-i-delete-an-exported-environment-variable

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

Tips on Fbs

Sort DataFrame in Spark

Comments¶

New Features in Spark 3

AQE (Adaptive Query Execution)¶

Pandas UDFs¶

Query Pandas Data Frames Using SQL

framequery ¶

Use pyarrow to Share Data in Memory in Python

References¶

Environment Variables in Shell

export¶

unset¶

References¶