Tips and Traps¶
If you use PySpark instead of Spark/Scala, pandas udf is a great alternative to all those (complicated) collections functions discussed here. Leveraging pandas udf, each partition of a Spark DataFrame can be converted to a pandas DataFrame without copying the underlying data, you can then do transforms on pandas DataFrames which will be converted back to partitons of a Spark DataFrame.
Rounding Functions in Spark
Get Size of Tables on HDFS
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
The HDFS Way¶
You can use the hdfs dfs -du /path/to/table
command
or hdfs dfs -count -q -v -h /path/to/table
to get the size of an HDFS path (or table).
However,
this only works if the cluster supports HDFS.
If a Spark cluster exposes only JDBC/ODBC APIs,
this method does not work.
Access Control in Spark SQL
Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Grant Permission to Users
GRANT
priv_type [, priv_type ] ...
ON database_table_or_view_name
TO principal_specification [, principal_specification] ...
[WITH GRANT OPTION];
Examples:
GRANT SELECT ON table1 TO USER user1;
GRANT SELECT ON DATABASE db1 TO USER user1 …