Handling Complicated Data Types in Python and PySpark

May 07, 2020

Tips and Traps¶

An element in a pandas DataFrame can be any (complicated) type in Python. To save a padnas DataFrame with arbitrary (complicated) types as it is, you have to use the pickle module . The method pandas.DataFrame.to_pickle (which is simply a wrapper over pickle.dump) serialize the DataFrame to a pickle file while the method pandas.read_pickle

Boolean Column Operators and Functions in Spark

Mar 30, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Select All Columns Except a Few from a Table

Jan 10, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Comments¶

There is no (direct) way of select all columns except a few from a table using SQL. However, this is easily doable with DataFrame APIs (pandas, Spark/PySpark, etc.).

Koalas is pandas API on PySpark

Dec 11, 2020

References¶

https://github.com/databricks/koalas

https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html

https://notebooks.gesis.org/binder/jupyter/user/databricks-koalas-mxw72n1l/notebooks/docs/source/getting_started/10min.ipynb

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

New Features in Spark 3

Jun 27, 2020

AQE (Adaptive Query Execution)¶

To enable AQE, you have to set spark.sql.adaptive.enabled to true (using --conf spark.sql.adaptive.enabled=true in spark-submit or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)

Pandas UDFs¶

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data to Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using pandas_udf

Broadcast Join in Spark

Jun 18, 2020

Tips and Traps¶

BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

Handling Complicated Data Types in Python and PySpark

Tips and Traps¶

Boolean Column Operators and Functions in Spark

Select All Columns Except a Few from a Table

Comments¶

Koalas is pandas API on PySpark

References¶

New Features in Spark 3

AQE (Adaptive Query Execution)¶

Pandas UDFs¶

Broadcast Join in Spark

Tips and Traps¶