Tips and Traps¶
If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.
Handling Complicated Data Types in Python and PySpark
Tips and Traps¶
An element in a pandas DataFrame can be any (complicated) type in Python. To save a padnas DataFrame with arbitrary (complicated) types as it is, you have to use the pickle module . The method
pandas.DataFrame.to_pickle
(which is simply a wrapper overpickle.dump
) serialize the DataFrame to a pickle file while the methodpandas.read_pickle