Ben Chuanlong Du's Blog

It is never too late to learn.

High Performance Computing in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Computing Frames

Apache Ray

A fast and simple framework for building and running distributed applications.

Ray does not handle large data well (as of 2018/05/28). Please refer to the …

Broadcast Join in Spark

Tips and Traps

  1. BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.

  2. Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.

Conversion Between PySpark DataFrames and pandas DataFrames

Comments

  1. A PySpark DataFrame can be converted to a pandas DataFrame by calling the method DataFrame.toPandas, and a pandas DataFrame can be converted to a PySpark DataFrame by calling SparkSession.createDataFrame. Notice that when you call DataFrame.toPandas to convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the method DataFrame.toPandas

Hands on the Python module numba

Installation

numba can be installed using the following command.

:::bash
pip3 install numba

If you need CUDA support, you have to install CUDA drivers.

:::bash
sudo apt-get install cuda-10-1

Instead of going through the hassle of configuring numba for GPU, a better way is to run numba in a Nvidia Docker container. The Docker image nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 has CUDA runtime installed, so it is as easy as installing numba on top it and you are ready to go. For more detailed instructions, please refer to

Hands on the Python module Multiprocessing

Comments

  1. multiprocess is a fork of the Python standard libary multiprocessing . multiprocess extends multiprocessing to provide enhanced serialization, using dill. multiprocess leverages multiprocessing to support the spawning of processes using the API of the python standard library's threading module.

  2. multiprocessing.Pool.map does not work with lambda functions due to the fact that lambda functions cannot be pickled. There are multiple approaches to avoid the issue. You can define a function or use functools.partial