Ben Chuanlong Du's Blog

It is never too late to learn.

Collection Functions in Spark

Tips and Traps

  1. If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.

Get Size of Tables on HDFS

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The HDFS Way

You can use the hdfs dfs -du /path/to/table command or hdfs dfs -count -q -v -h /path/to/table to get the size of an HDFS path (or table). However, this only works if the cluster supports HDFS. If a Spark cluster exposes only JDBC/ODBC APIs, this method does not work.

Access Control in Spark SQL

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Grant Permission to Users

GRANT
    priv_type [, priv_type ] ...
    ON database_table_or_view_name
    TO principal_specification [, principal_specification] ...
    [WITH GRANT OPTION];

Examples:

GRANT SELECT ON table1 TO USER user1;
GRANT SELECT ON DATABASE db1 TO USER user1 …