Spark / PySpark -无法从Docker上的Spark Cluster获取返回值



我有两个docker容器运行:

172.22.0.3 spark-worker-1
172.22.0.2 spark-master

在主容器内,我执行一个简单的python脚本来测试功能。当访问Spark Master UI时,我可以看到正在连接的worker和ALIVE,我可以看到应用程序出现在"正在运行的应用程序"下;并移至"已完成申请";State "FINISHED">

问题:我无法接收值(df.show())与下面的日志。当不连接到SparkSession中的master时,代码运行良好。容器有足够的CPU/MEM(而且作业很少)。我已经想不出解决这个问题的办法了,有人有意见吗?

Python代码:

import pyspark
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.appName("dev_sparkTest")
.master("spark://172.22.0.2:7077")
.getOrCreate()
pdf = pd.DataFrame(
{
"col1": [np.random.randint(10) for x in range(10)],
"col2": [np.random.randint(100) for x in range(10)],
}
)
df = spark.createDataFrame(pdf)
df.show()
Python脚本输出:
/ # python3 sparkTest.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/30 13:08:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/05/30 13:08:14 ERROR TaskSchedulerImpl: Lost executor 0 on 172.22.0.3: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:14 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (172.22.0.3 executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:16 ERROR TaskSchedulerImpl: Lost executor 1 on 172.22.0.3: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:16 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (172.22.0.3 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:19 ERROR TaskSchedulerImpl: Lost executor 2 on 172.22.0.3: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:19 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) (172.22.0.3 executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:21 ERROR TaskSchedulerImpl: Lost executor 3 on 172.22.0.3: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:21 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) (172.22.0.3 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/30 13:08:21 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "sparkTest.py", line 19, in <module>
df.show()
File "/usr/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 494, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/python3.7/site-packages/py4j/java_gateway.py", line 1322, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/python3.7/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/usr/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o46.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (172.22.0.3 executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:476)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)

spark-worker-1容器日志:

22/05/30 13:08:10 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=41589" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@0f53b6eaef4e:41589" "--executor-id" "0" "--hostname" "172.22.0.3" "--cores" "16" "--app-id" "app-20220530130810-0005" "--worker-url" "spark://Worker@172.22.0.3:36935"
22/05/30 13:08:14 INFO Worker: Executor app-20220530130810-0005/0 finished with state EXITED message Command exited with code 50 exitStatus 50
22/05/30 13:08:14 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
22/05/30 13:08:14 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220530130810-0005, execId=0)
22/05/30 13:08:14 INFO Worker: Asked to launch executor app-20220530130810-0005/1 for dev_sparkTest
22/05/30 13:08:14 INFO SecurityManager: Changing view acls to: root
22/05/30 13:08:14 INFO SecurityManager: Changing modify acls to: root
22/05/30 13:08:14 INFO SecurityManager: Changing view acls groups to: 
22/05/30 13:08:14 INFO SecurityManager: Changing modify acls groups to: 
22/05/30 13:08:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/05/30 13:08:14 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=41589" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@0f53b6eaef4e:41589" "--executor-id" "1" "--hostname" "172.22.0.3" "--cores" "16" "--app-id" "app-20220530130810-0005" "--worker-url" "spark://Worker@172.22.0.3:36935"
22/05/30 13:08:16 INFO Worker: Executor app-20220530130810-0005/1 finished with state EXITED message Command exited with code 50 exitStatus 50
22/05/30 13:08:16 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 1
22/05/30 13:08:16 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220530130810-0005, execId=1)
22/05/30 13:08:17 INFO Worker: Asked to launch executor app-20220530130810-0005/2 for dev_sparkTest
22/05/30 13:08:17 INFO SecurityManager: Changing view acls to: root
22/05/30 13:08:17 INFO SecurityManager: Changing modify acls to: root
22/05/30 13:08:17 INFO SecurityManager: Changing view acls groups to: 
22/05/30 13:08:17 INFO SecurityManager: Changing modify acls groups to: 
22/05/30 13:08:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/05/30 13:08:17 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=41589" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@0f53b6eaef4e:41589" "--executor-id" "2" "--hostname" "172.22.0.3" "--cores" "16" "--app-id" "app-20220530130810-0005" "--worker-url" "spark://Worker@172.22.0.3:36935"
22/05/30 13:08:19 INFO Worker: Executor app-20220530130810-0005/2 finished with state EXITED message Command exited with code 50 exitStatus 50
22/05/30 13:08:19 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 2
22/05/30 13:08:19 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220530130810-0005, execId=2)
22/05/30 13:08:19 INFO Worker: Asked to launch executor app-20220530130810-0005/3 for dev_sparkTest
22/05/30 13:08:19 INFO SecurityManager: Changing view acls to: root
22/05/30 13:08:19 INFO SecurityManager: Changing modify acls to: root
22/05/30 13:08:19 INFO SecurityManager: Changing view acls groups to: 
22/05/30 13:08:19 INFO SecurityManager: Changing modify acls groups to: 
22/05/30 13:08:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/05/30 13:08:19 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=41589" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@0f53b6eaef4e:41589" "--executor-id" "3" "--hostname" "172.22.0.3" "--cores" "16" "--app-id" "app-20220530130810-0005" "--worker-url" "spark://Worker@172.22.0.3:36935"
22/05/30 13:08:21 INFO Worker: Executor app-20220530130810-0005/3 finished with state EXITED message Command exited with code 50 exitStatus 50
22/05/30 13:08:21 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 3
22/05/30 13:08:21 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220530130810-0005, execId=3)
22/05/30 13:08:21 INFO Worker: Asked to launch executor app-20220530130810-0005/4 for dev_sparkTest
22/05/30 13:08:21 INFO SecurityManager: Changing view acls to: root
22/05/30 13:08:21 INFO SecurityManager: Changing modify acls to: root
22/05/30 13:08:21 INFO SecurityManager: Changing view acls groups to: 
22/05/30 13:08:21 INFO SecurityManager: Changing modify acls groups to: 
22/05/30 13:08:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/05/30 13:08:21 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=41589" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@0f53b6eaef4e:41589" "--executor-id" "4" "--hostname" "172.22.0.3" "--cores" "16" "--app-id" "app-20220530130810-0005" "--worker-url" "spark://Worker@172.22.0.3:36935"
22/05/30 13:08:21 INFO Worker: Asked to kill executor app-20220530130810-0005/4
22/05/30 13:08:21 INFO ExecutorRunner: Runner thread for executor app-20220530130810-0005/4 interrupted
22/05/30 13:08:21 INFO ExecutorRunner: Killing process!
22/05/30 13:08:21 INFO Worker: Executor app-20220530130810-0005/4 finished with state KILLED exitStatus 143
22/05/30 13:08:21 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 4
22/05/30 13:08:21 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220530130810-0005, execId=4)
22/05/30 13:08:21 INFO Worker: Cleaning up local directories for application app-20220530130810-0005
22/05/30 13:08:21 INFO ExternalShuffleBlockResolver: Application app-20220530130810-0005 removed, cleanupLocalDirs = true

spark-master容器日志:

22/05/30 13:08:10 INFO Master: Registering app dev_sparkTest
22/05/30 13:08:10 INFO Master: Registered app dev_sparkTest with ID app-20220530130810-0005
22/05/30 13:08:10 INFO Master: Launching executor app-20220530130810-0005/0 on worker worker-20220530130245-172.22.0.3-36935
22/05/30 13:08:14 INFO Master: Removing executor app-20220530130810-0005/0 because it is EXITED
22/05/30 13:08:14 INFO Master: Launching executor app-20220530130810-0005/1 on worker worker-20220530130245-172.22.0.3-36935
22/05/30 13:08:16 INFO Master: Removing executor app-20220530130810-0005/1 because it is EXITED
22/05/30 13:08:16 INFO Master: Launching executor app-20220530130810-0005/2 on worker worker-20220530130245-172.22.0.3-36935
22/05/30 13:08:19 INFO Master: Removing executor app-20220530130810-0005/2 because it is EXITED
22/05/30 13:08:19 INFO Master: Launching executor app-20220530130810-0005/3 on worker worker-20220530130245-172.22.0.3-36935
22/05/30 13:08:21 INFO Master: Removing executor app-20220530130810-0005/3 because it is EXITED
22/05/30 13:08:21 INFO Master: Launching executor app-20220530130810-0005/4 on worker worker-20220530130245-172.22.0.3-36935
22/05/30 13:08:21 INFO Master: Received unregister request from application app-20220530130810-0005
22/05/30 13:08:21 INFO Master: Removing app app-20220530130810-0005
22/05/30 13:08:21 WARN Master: Got status update for unknown executor app-20220530130810-0005/4
22/05/30 13:08:21 INFO Master: 172.22.0.2:56308 got disassociated, removing it.
22/05/30 13:08:21 INFO Master: 0f53b6eaef4e:41589 got disassociated, removing it.

您是否检查了同一docker网络中的容器?

如果您在docker-compose上运行spark,请尝试使用容器名称而不是ip地址。