远程RPC客户端解除关联.可能是由于容器超出阈值或网络问题.检查驱动程序日志中的WARN消息

我在5个节点的集群上工作，每个节点有7个核心，每个节点25GB。我目前的执行使用1-2GB输入数据，我能知道为什么我得到以下错误吗?我使用pyspark dataframe (spark 1.6.2)

[Stage 9487:===================================================>(198 + 2) / 200]16/08/13 16:43:18 ERROR TaskSchedulerImpl: Lost executor 3 on server05: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=================================================>(198 + -49) / 200]16/08/13 16:43:19 ERROR TaskSchedulerImpl: Lost executor 1 on server04: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=========>                                          (24 + 15) / 125]16/08/13 16:46:38 ERROR TaskSchedulerImpl: Lost executor 2 on server01: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:==========================================>        (148 + 30) / 178]16/08/13 16:51:36 ERROR TaskSchedulerImpl: Lost executor 0 on server03: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=============================>                       (50 + 12) / 91]16/08/13 16:55:32 ERROR TaskSchedulerImpl: Lost executor 4 on server02: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:============================>                       (50 + -39) / 91]Traceback (most recent call last):
  File "/home/example.py", line 397, in <module>
  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 269, in count
  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9162.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 9487 (count at null:-1) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 577
        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutpu

如何将下面的groupBYKey更改为ReduceByKey ?Python 3.4+， spark 1.6.2

def testFunction (key, values)
    << Some statistical process for each group>>
    << each group will have n (300K to 1M) rows>>
    << i am applying statistical function to each group>>

resRDD = df.select(["key1", "key2", "key3", "val1",  "val2"])
           .map(lambda r: (Row(key1 = r.key1, key2 = r.key2, key3 = r.key3), r))
           .groupByKey()
           .flatMap(lambda KeyValue: testFunction(KeyValue[0], list(KeyValue[1])))

我通过减小spark.executor.memory来解决这个问题。也许在我的集群上运行着除spark之外的其他应用程序。吃掉所有这些记忆会减慢我的员工，从而导致沟通障碍。

相关内容

最新更新

热门标签：