Collection Functions in Spark

Sep 05, 2020

Tips and Traps¶

If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.

Handling Complicated Data Types in Python and PySpark

May 07, 2020

Tips and Traps¶

An element in a pandas DataFrame can be any (complicated) type in Python. To save a padnas DataFrame with arbitrary (complicated) types as it is, you have to use the pickle module . The method pandas.DataFrame.to_pickle (which is simply a wrapper over pickle.dump) serialize the DataFrame to a pickle file while the method pandas.read_pickle