Tips and Traps¶
BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.
Null Values in Inner Join of Spark Dataframes
Merge/Join pandas DataFrames
Reference¶
https://pandas.pydata.org/pandas-docs/stable/merging.html
http://stackoverflow.com/questions/22676081/pandas-the-difference-between-join-and-merge
Comment¶
You are able to specify (via
left_on
andright_on
) which columns to join in each data frame.Columns that appear in both data frames but not used in joining are distinguished using suffixes.