Tips and Traps¶
An element in a pandas DataFrame can be any (complicated) type in Python. To save a padnas DataFrame with arbitrary (complicated) types as it is, you have to use the pickle module . The method
pandas.DataFrame.to_pickle
(which is simply a wrapper overpickle.dump
) serialize the DataFrame to a pickle file while the methodpandas.read_pickle
Boolean Column Operators and Functions in Spark
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Select All Columns Except a Few from a Table
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Comments¶
There is no (direct) way of select all columns except a few from a table using SQL. However, this is easily doable with DataFrame APIs (pandas, Spark/PySpark, etc.).
Koalas is pandas API on PySpark
References¶
https://github.com/databricks/koalas
https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
New Features in Spark 3
AQE (Adaptive Query Execution)¶
To enable AQE,
you have to set spark.sql.adaptive.enabled
to true
(using --conf spark.sql.adaptive.enabled=true
in spark-submit
or using `spark.config("spark.sql.adaptive,enabled", "true") in Spark/PySpark code.)
Pandas UDFs¶
Pandas UDFs are user defined functions
that are executed by Spark using Arrow
to transfer data to Pandas to work with the data,
which allows vectorized operations.
A Pandas UDF is defined using pandas_udf
Broadcast Join in Spark
Tips and Traps¶
BroadcastHashJoin, i.e., map-side join is fast. Use BroadcastHashJoin if possible. Notice that Spark will automatically use BroacastHashJoin if a table in inner join has a size less then the configured BroadcastHashJoin limit.
Notice that BroadcastJoin only works for inner joins. If you have a outer join, BroadcastJoin won't happend even if you explicitly Broadcast a DataFrame.