Environment Variables¶
export
unset
Tips and Traps¶
explainshell.com is a great place for learning shell.
Bash-it/bash-it is a great community driven Bash framework.
It is suggested that you avoid writing complicated Bash scripts. IPython is a much better alternative.
Do NOT use
;to delimit paths passed to a shell command because;
Broadcast Join in Spark
Tips and Traps¶
BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.
Conversion Between PySpark DataFrames and pandas DataFrames
Comments¶
A PySpark DataFrame can be converted to a pandas DataFrame by calling the method
DataFrame.toPandas, and a pandas DataFrame can be converted to a PySpark DataFrame by callingSparkSession.createDataFrame. Notice that when you callDataFrame.toPandasto convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the methodDataFrame.toPandas