Ben Chuanlong Du's Blog

It is never too late to learn.

High Performance Computing in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Computing Frames

Apache Ray

A fast and simple framework for building and running distributed applications.

Ray does not handle large data well (as of 2018/05/28). Please refer to the …

Free High Performance Computing Resouces

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Cloud Platforms

Some cloud platform forms offer free trials or credits.

Using GPUs

A Beginners Guide to Basic GPU Application for Integer Calculations https://saturncloud.io/blog/a-beginners-guide-to-basic-gpu-application-for-integer-calculations/#:~:text …

Broadcast Join in Spark

Tips and Traps

  1. BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.

  2. Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.

Conversion Between PySpark DataFrames and pandas DataFrames

Comments

  1. A PySpark DataFrame can be converted to a pandas DataFrame by calling the method DataFrame.toPandas, and a pandas DataFrame can be converted to a PySpark DataFrame by calling SparkSession.createDataFrame. Notice that when you call DataFrame.toPandas to convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the method DataFrame.toPandas

Hands on the Python module numba

Installation

numba can be installed using the following command.

:::bash
pip3 install numba

If you need CUDA support, you have to install CUDA drivers.

:::bash
sudo apt-get install cuda-10-1

Instead of going through the hassle of configuring numba for GPU, a better way is to run numba in a Nvidia Docker container. The Docker image nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 has CUDA runtime installed, so it is as easy as installing numba on top it and you are ready to go. For more detailed instructions, please refer to

Parallel Computing in Java

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The following are a few tips for multithreading parallel computing in Java.

  1. Instance fields, static fields and elements of arrays are stored in heap memory and thus can be shared between …