Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Symptoms
Symptom 1
Container killed by YARN for exceeding memory limits.
22.0 GB of 19 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead
or disabling yarn.nodemanager.vmem-check-enabled
because of YARN-4714.
Symptom 2
Job aborted due to stage failure: Task 110 in stage 68.0 failed 1 times, most recent failure: Lost task 110.0 in stage 68.0: ExecutorLostFailure (executor 35 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 40.6 GB of 40 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Symptom 3
16/04/22 04:27:18 WARN yarn.YarnAllocator: Container marked as failed: container_1459803563374_223497_02_000067 on host. Exit status: 143. Diagnostics: Container [pid=30502,containerID=container_1459803563374_223497_02_000067] is running beyond physical memory limits. Current usage: 13.8 GB of 13.8 GB physical memory used; 14.6 GB of 28.9 GB virtual memory used. Killing container. Dump of the process-tree for container_1459803563374_223497_02_000067 : - PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE - 30502 18022 30502 30502 (bash) 0 0 22773760 347 /bin/bash -c LD_LIBRARY_PATH=/apache/hadoop/lib/native:/apache/hadoop/lib/native/Linux-amd64-64: /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' ... container_1459803563374_223497_02_000067/stderr ... Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
Possible Causes
Spark on Yarn and virtual memory error and Container killed by YARN for exceeding memory limits have good discussions on solutions to fix the issue including some low-level explanation of the issue.
-
A bug (YARN-4714) in YARN.
-
Too much usage of off-heap memory. Spark Tungsten leverages off-heap memory a lot to boost performance. Some Java operations (especially IO related) also levarages off-heap memory. Usage of off-heap memory in Spark 2 (this has been changed in Spark 3) is control by
spark.yarn.executor.memoryOverhead
. Generally speaking, it is hard to control the usage of off-heap memory unless the corresponding Java operations provide such options. The JVM optionMaxDirectMemorySize
specifies the maximum total size ofjava.nio
(New I/O package) direct buffer allocations (off-heap memory), which is used with network data transfer and serialization activity. -
data skew (e.g., big data table but not partitioned)
-
Some tables in joins are too large.
Possible Solutions
-
Increase memory overhead. For example, the below configuration set memory overhead to 8G.
--conf spark.yarn.executor.memoryOverhead=8G
-
Reducing the number of executor cores (which helps reducing memory consumption). For example, change
--execuor-cores=4
to--execuor-cores=2
. -
Increase the number of partitions (which makes each task smaller and helps reducing memory consumption).
--conf spark.sql.shuffle.partitions=2000
-
Configure the JVM option
MaxDirectMemorySize
if your Spark application involves reading Parquet files and/or encoding/decoding BASE64 string, etc.
By default,MaxDirectMemorySize
is close to the size of heap memory size. So, ifMaxDirectoryMemorySize
is not set, Spark containers might use too much off-heap memory.--conf spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=8G
References
Solving “Container Killed by Yarn For Exceeding Memory Limits” Exception in Apache Spark
https://www.youtube.com/watch?v=t97VJtPAL2s
Turn the virtual memory check to be off by default
[Java 8] Over usage of virtual memory
https://stackoverflow.com/questions/37505638/understanding-spark-physical-plan
https://community.hortonworks.com/questions/36266/spark-physical-plan-doubts-tungstenaggregate-tungs.html
Apache Spark and off-heap memory
Decoding Memory in Spark — Parameters that are often confused
http://stackoverflow.com/questions/29850784/what-are-the-likely-causes-of-org-apache-spark-shuffle-metadatafetchfailedexcept
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
https://issues.apache.org/jira/browse/SPARK-4516?focusedCommentId=14220157&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14220157