Spark Issue: Total Size of Serialized Results Is Bigger than spark.driver.maxResultSize

Feb 21, 2019

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Issue

Total size of serialized results is bigger than spark.driver.maxResultSize

Solutions

Eliminate unnecessary broadcast or collect.
If one of the tables for joining contains too large number of partitions …

Spark Issue: Unable to Find Encoder Type

May 21, 2019

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Issue Unable to find encoder for type stored in a Dataset

Solution …

Select All Columns Except a Few from a Table

Jan 10, 2021

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Comments¶

There is no (direct) way of select all columns except a few from a table using SQL. However, this is easily doable with DataFrame APIs (pandas, Spark/PySpark, etc.).

Get Size of Tables on HDFS

Nov 10, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The HDFS Way¶

You can use the hdfs dfs -du /path/to/table command or hdfs dfs -count -q -v -h /path/to/table to get the size of an HDFS path (or table). However, this only works if the cluster supports HDFS. If a Spark cluster exposes only JDBC/ODBC APIs, this method does not work.

Access Control in Spark SQL

Jul 22, 2020

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Grant Permission to Users

GRANT
    priv_type [, priv_type ] ...
    ON database_table_or_view_name
    TO principal_specification [, principal_specification] ...
    [WITH GRANT OPTION];

Examples:

GRANT SELECT ON table1 TO USER user1;
GRANT SELECT ON DATABASE db1 TO USER user1 …

Read/Write Files/Tables in Spark

Sep 27, 2020

References¶

DataFrameReader APIs

DataFrameWriter APIs

https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources

Comments¶

It is suggested that you specify a schema when reading text files. If a schema is not specified when reading text files, it is good practice to check the types of columns (as the types are inferred).
Do NOT read data from and write data to the same path in Spark! Due to lazy evaluation of Spark, the path will likely be cleared before it is read into Spark, which will throw IO exceptions. And the worst part is that your data on HDFS is removed but recoverable.

← Older Newer →

Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: Total Size of Serialized Results Is Bigger than spark.driver.maxResultSize

Issue

Solutions

Spark Issue: Unable to Find Encoder Type

Issue Unable to find encoder for type stored in a Dataset

Solution …

Select All Columns Except a Few from a Table

Comments¶

Get Size of Tables on HDFS

The HDFS Way¶

Access Control in Spark SQL

Grant Permission to Users

Read/Write Files/Tables in Spark

References¶

Comments¶