Tips and Traps¶
BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.
Conversion Between PySpark DataFrames and pandas DataFrames
Comments¶
A PySpark DataFrame can be converted to a pandas DataFrame by calling the method
DataFrame.toPandas, and a pandas DataFrame can be converted to a PySpark DataFrame by callingSparkSession.createDataFrame. Notice that when you callDataFrame.toPandasto convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the methodDataFrame.toPandas