我在5个节点的集群上工作,每个节点有7个核心,每个节点25GB。我目前的执行使用1-2GB输入数据,我能知道为什么我得到以下错误吗?我使用pyspark dataframe (spark 1.6.2)
[Stage 9487:===================================================>(198 + 2) / 200]16/08/13 16:43:18 ERROR TaskSchedulerImpl: Lost executor 3 on server05: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=================================================>(198 + -49) / 200]16/08/13 16:43:19 ERROR TaskSchedulerImpl: Lost executor 1 on server04: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=========> (24 + 15) / 125]16/08/13 16:46:38 ERROR TaskSchedulerImpl: Lost executor 2 on server01: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:==========================================> (148 + 30) / 178]16/08/13 16:51:36 ERROR TaskSchedulerImpl: Lost executor 0 on server03: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=============================> (50 + 12) / 91]16/08/13 16:55:32 ERROR TaskSchedulerImpl: Lost executor 4 on server02: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:============================> (50 + -39) / 91]Traceback (most recent call last):
File "/home/example.py", line 397, in <module>
File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 269, in count
File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9162.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 9487 (count at null:-1) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 577
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutpu
如何将下面的groupBYKey更改为ReduceByKey ?Python 3.4+, spark 1.6.2
def testFunction (key, values)
<< Some statistical process for each group>>
<< each group will have n (300K to 1M) rows>>
<< i am applying statistical function to each group>>
resRDD = df.select(["key1", "key2", "key3", "val1", "val2"])
.map(lambda r: (Row(key1 = r.key1, key2 = r.key2, key3 = r.key3), r))
.groupByKey()
.flatMap(lambda KeyValue: testFunction(KeyValue[0], list(KeyValue[1])))
我通过减小spark.executor.memory
来解决这个问题。也许在我的集群上运行着除spark之外的其他应用程序。吃掉所有这些记忆会减慢我的员工,从而导致沟通障碍。