Tips and Traps¶
BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.
In [1]:
import pandas as pd
import findspark
findspark.init("/opt/spark-3.0.0-bin-hadoop3.2/")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark_Union").enableHiveSupport().getOrCreate()
In [5]:
df1 = spark.createDataFrame(
pd.DataFrame(
data=[
["Ben", 2],
["Dan", 4],
["Will", 1],
],
columns=["name", "id"],
)
)
df1.show()
In [6]:
df2 = spark.createDataFrame(
pd.DataFrame(
data=[
["Ben", 30],
["Dan", 25],
["Will", 26],
],
columns=["name", "age"],
)
)
df2.show()
In [8]:
df1.join(df2, ["name"]).explain()
Notice that BroadcastHashJoin
is used in the following execution plan.
In [9]:
df1.join(broadcast(df2), ["name"]).explain()
Notice that BroadcastHashJoin
cannot be used for outer joins!
In [11]:
df1.join(broadcast(df2), ["name"], "right_outer").explain()
In [ ]: