Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Symptom
Block rdd_123_456 could not be removed as it was not found on disk or in memory.
Cause
-
The execution plan of a DataFrame is too complicated.
-
Not enough memory to persist DataFrames (even if you used the default persist option
StorageLevel.MEMORY_AND_DISK
).
Possible Solutions
-
Try triggering an eager DataFrame persist (by calling
DataFrame.count
afterDataFrame.persit
) . If you can stand some loss of performance, try usingDataFrame.checkpoint
instead ofDataFrame.persist
. For more discussions on DataFrame persist vs checkpoint, please refer to Persist and Checkpoint DataFrames in Spark . -
Increase executor memory (
--executor-memory
). If you persist DataFrame using the optionOFF_HEAP
, increase memory overhead. -
Use the storage level which consumes less memory. For example, if you have been using the default storage level
StorageLevel.MEMORY_AND_DISK
(in PySpark 2) you can tryStorageLevel.MEMORY_AND_DISK_SER
orStorageLevel.DISK_ONLY
. -
Do not persist DataFrames (at the cost of lower performance). Notice that even if you persist DataFrames to disk only, you might still encounter this issue due to lack of disk space for caching.
-
Increase the number of partitions, which makes each partition smaller.
-
Increase the number of executors (
--num-executors
), which increases the total disk space for caching. -
Reduce the number of cores per executor (
--execuor-cores
). -
Ask Hadoop/Spark admin to increase local disk space for caching.