Comments¶
A PySpark DataFrame can be converted to a pandas DataFrame by calling the method
DataFrame.toPandas, and a pandas DataFrame can be converted to a PySpark DataFrame by callingSparkSession.createDataFrame. Notice that when you callDataFrame.toPandasto convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the methodDataFrame.toPandas
Logging in PySpark
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Excessive logging is better than no logging! This is generally true in distributed big data applications.
-
Use
loguruif it is available. If you have to use theloggingmodule, be …
Read/Write CSV in PySpark
Load Data in CSV Format¶
.loadis a general method for reading data in different format. You have to specify the format of the data via the method.formatof course..csv(both for CSV and TSV),.jsonand.parquetare specializations of.load..formatis optional if you use a specific loading function (csv, json, etc.).
Using Temporary Columns in Spark
Subtle Differences Among Spark DataFrame and PySpark Dataframe
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Besides using the
colfunction to reference a column, Spark/Scala DataFrame supports using$"col_name"(based on implicit conversion and must haveimport spark.implicit._) while PySpark DataFrame support using …