Conversion Between PySpark DataFrames and pandas DataFrames

Jun 18, 2020

Comments¶

A PySpark DataFrame can be converted to a pandas DataFrame by calling the method DataFrame.toPandas, and a pandas DataFrame can be converted to a PySpark DataFrame by calling SparkSession.createDataFrame. Notice that when you call DataFrame.toPandas to convert a Spark DataFrame to a pandas DataFrame, the whole Spark DataFrame is collected to the driver machine! This means that you should only call the method DataFrame.toPandas

Jun 15, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Excessive logging is better than no logging! This is generally true in distributed big data applications.
Use loguru if it is available. If you have to use the logging module, be …

May 22, 2020

.load is a general method for reading data in different format. You have to specify the format of the data via the method .format of course. .csv (both for CSV and TSV), .json and .parquet are specializations of .load. .format is optional if you use a specific loading function (csv, json, etc.).

May 21, 2020

Feb 19, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Besides using the col function to reference a column, Spark/Scala DataFrame supports using $"col_name" (based on implicit conversion and must have import spark.implicit._) while PySpark DataFrame support using …

Dec 19, 2019

Newer →