Collection Functions in Spark

Sep 05, 2020

Tips and Traps¶

If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.

Rounding Functions in Spark

Jun 17, 2020

Get Size of Tables on HDFS

Nov 10, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The HDFS Way¶

You can use the hdfs dfs -du /path/to/table command or hdfs dfs -count -q -v -h /path/to/table to get the size of an HDFS path (or table). However, this only works if the cluster supports HDFS. If a Spark cluster exposes only JDBC/ODBC APIs, this method does not work.

Access Control in Spark SQL

Jul 22, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Grant Permission to Users

GRANT
    priv_type [, priv_type ] ...
    ON database_table_or_view_name
    TO principal_specification [, principal_specification] ...
    [WITH GRANT OPTION];

Examples:

GRANT SELECT ON table1 TO USER user1;
GRANT SELECT ON DATABASE db1 TO USER user1 …

Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.