Ben Chuanlong Du's Blog

It is never too late to learn.

Spark Issue: ArrowTypeError: Expect a Type but Got a Different Type

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptom

pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64

Possible Causes

A pandas_udf tag specifies a return type of String but the corresponding pandas udf returns a different …

Spark Issue: Runtimeerror: Arrow Legacy IPC Format Is Not Supported

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

RuntimeError: Arrow legacy IPC format is not supported in PySpark, please unset ARROW_PRE_0_15_IPC_FORMAT

Possible Causes

You are using PySpark 3.0+ with one (or both) of the following options.

--conf …

Spark Issue: AnalysisException: Found Duplicated Columns

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptoms

pyspark.sql.utils.AnalysisException: Found duplicate column(s) when inserting into ...

Possible Causes

As the error message says, there are duplicated columns in your Spark SQL code.

Possible Solutions

Fix …

Spark Issue: GetQuotaUsage

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Symptom I

py4j.protocol.Py4JJavaError: An error occurred while calling o156.getQuotaUsage.

Symptom II

org.apache.hadoop.ipc.RemoteException(java.io.IOException): The quota system is disabled in Router.

Possible Causes …

Spark Issue: Pure Python Code Errors

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

This post collects some typical pure Python errors in PySpark applications.

Symptom 1

object has no attribute

Solution 1

Fix the attribute name.

Symptom 2

No such file or directory

Solution …

Configure Log4J for Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Show Error Messages Only

When you run Spark or PySpark in a Jupyter/Lab notebook, it is recommended that you show ERROR messages only. Otherwise, there might be too much logging information polluting your notebook. You can set the log level of Spark to ERROR using the following line of code.