Spark应用退出,状态为143



我有一个庞大的spark应用程序,它一直在重试,我能通过UI找到的唯一有用的日志是来自stdout:

2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=6321165190495215882, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=7370805066606093965, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=8523609779541081889, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=8861954111730219182, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=5535068542584258152, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.562; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:35:34 ERROR YarnClusterScheduler:70 - Lost executor 205 on compute006: Container container_e434_1628615141783_154721_01_000245 on host: compute006 was preempted.
2021-08-16 17:35:59 ERROR YarnClusterScheduler:70 - Lost executor 203 on compute007: Container container_e434_1628615141783_154721_01_000242 on host: compute007 was preempted.
2021-08-16 17:38:50 ERROR YarnClusterScheduler:70 - Lost executor 209 on data267: Container container_e434_1628615141783_154721_01_000241 on host: data267 was preempted.
2021-08-16 17:40:56 ERROR YarnClusterScheduler:70 - Lost executor 211 on data133: Container container_e434_1628615141783_154721_01_000248 on host: data133 was preempted.
2021-08-16 17:44:01 ERROR YarnClusterScheduler:70 - Lost executor 157 on data034: Container container_e434_1628615141783_154721_01_000185 on host: data034 was preempted.
2021-08-16 17:44:26 ERROR YarnClusterScheduler:70 - Lost executor 202 on data234: Container container_e434_1628615141783_154721_01_000244 on host: data234 was preempted.
2021-08-16 18:05:34 ERROR YarnClusterScheduler:70 - Lost executor 225 on data001: Container container_e434_1628615141783_154721_01_000262 on host: data001 was preempted.
2021-08-16 18:05:49 ERROR YarnClusterScheduler:70 - Lost executor 227 on data244: Container container_e434_1628615141783_154721_01_000264 on host: data244 was preempted.
2021-08-16 18:06:16 ERROR YarnClusterScheduler:70 - Lost executor 214 on data027: Container container_e434_1628615141783_154721_01_000251 on host: data027 was preempted.
2021-08-16 18:06:23 ERROR ApplicationMaster:43 - RECEIVED SIGNAL TERM
2021-08-16 18:06:23 ERROR ApplicationMaster:70 - User application exited with status 143
2021-08-16 18:06:23 ERROR FileFormatWriter:91 - Aborting job ea540d12-ad13-4e88-95fb-d8ac7f250503.
org.apache.spark.SparkException: Job 127 cancelled because SparkContext was shut down

Spark应用成功运行的概率大于没有错误的概率。听起来143是典型的OOM错误,但我的内存配置相当高:

'executor_memory': '10G',
'driver_memory': '12G',
'spark.executor.memoryOverhead': '5G',
'spark.driver.memoryOverhead': '4G',
弄清这件事的最好方法是什么?

您应该利用spark UI中的信息来更好地了解整个应用程序中正在发生的事情。注意溢出、随机读取大小和随机读取大小之间的倾斜。这些信息应该给你一个很好的指示发生了什么,以及如何适当地修复或调整应用程序,例如你可能需要增加spark.sql.shuffle.partitions等

相关内容

  • 没有找到相关文章

最新更新