Ben Chuanlong Du's Blog

It is never too late to learn.

Shell in Docker

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Configure the Shell for the RUN Command

https://docs.docker.com/engine/reference/builder/#shell

Configure the Default Shell for Terminals in Docker Containers

Just set the SHELL environment variable in …

Broadcast Join in Spark

Tips and Traps

  1. BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.

  2. Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.

Conversion Between PySpark DataFrames and pandas DataFrames

Comments

  1. A PySpark DataFrame can be converted to a pandas DataFrame by calling the method DataFrame.toPandas, and a pandas DataFrame can be converted to a PySpark DataFrame by calling SparkSession.createDataFrame. Notice that when you call DataFrame.toPandas to convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the method DataFrame.toPandas