Comments¶
A PySpark DataFrame can be converted to a pandas DataFrame by calling the method
DataFrame.toPandas
, and a pandas DataFrame can be converted to a PySpark DataFrame by callingSparkSession.createDataFrame
. Notice that when you callDataFrame.toPandas
to convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the methodDataFrame.toPandas
Logging in PySpark
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Excessive logging is better than no logging! This is generally true in distributed big data applications.
-
Use
loguru
if it is available. If you have to use thelogging
module, be …
Read/Write CSV in PySpark
Load Data in CSV Format¶
.load
is a general method for reading data in different format. You have to specify the format of the data via the method.format
of course..csv
(both for CSV and TSV),.json
and.parquet
are specializations of.load
..format
is optional if you use a specific loading function (csv, json, etc.).
Using Temporary Columns in Spark
Subtle Differences Among Spark DataFrame and PySpark Dataframe
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
-
Besides using the
col
function to reference a column, Spark/Scala DataFrame supports using$"col_name"
(based on implicit conversion and must haveimport spark.implicit._
) while PySpark DataFrame support using …