Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: Unable to Find Encoder Type

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Issue Unable to find encoder for type stored in a Dataset

Solution …

Select All Columns Except a Few from a Table

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Comments

There is no (direct) way of select all columns except a few from a table using SQL. However, this is easily doable with DataFrame APIs (pandas, Spark/PySpark, etc.).

Get Size of Tables on HDFS

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The HDFS Way

You can use the hdfs dfs -du /path/to/table command or hdfs dfs -count -q -v -h /path/to/table to get the size of an HDFS path (or table). However, this only works if the cluster supports HDFS. If a Spark cluster exposes only JDBC/ODBC APIs, this method does not work.

Access Control in Spark SQL

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Grant Permission to Users

GRANT
    priv_type [, priv_type ] ...
    ON database_table_or_view_name
    TO principal_specification [, principal_specification] ...
    [WITH GRANT OPTION];

Examples:

GRANT SELECT ON table1 TO USER user1;
GRANT SELECT ON DATABASE db1 TO USER user1 …

Read/Write Files/Tables in Spark

References

DataFrameReader APIs

DataFrameWriter APIs

https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources

Comments

  1. It is suggested that you specify a schema when reading text files. If a schema is not specified when reading text files, it is good practice to check the types of columns (as the types are inferred).

  2. Do NOT read data from and write data to the same path in Spark! Due to lazy evaluation of Spark, the path will likely be cleared before it is read into Spark, which will throw IO exceptions. And the worst part is that your data on HDFS is removed but recoverable.