Pyspark yarn集群提交错误(无法运行Python程序)

我正在尝试使用pandas-udf提交pyspark代码(使用fb先知…(它在本地提交中运行良好，但在等集群提交中出错

Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 41, ip-172-31-11-94.ap-northeast-2.compute.internal, executor 2): java.io.IOException: Cannot run program
&quot;/mnt/yarn/usercache/hadoop/appcache/application_1620263926111_0229/container_1620263926111_0229_01_000001/environment/bin/python&quot;: error=2, No such file or directory

我的火花提交代码：

PYSPARK_PYTHON=./environment/bin/python 
spark-submit 
--master yarn 
--deploy-mode cluster 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python     
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python     
--jars jars/org.elasticsearch_elasticsearch-spark-20_2.11-7.10.2.jar 
--py-files dependencies.zip   
--archives ./environment.tar.gz#environment 
--files config.ini 
$1

我用conda包制作了environment.tar.gz，dependencies.zip作为我的本地包config.ini加载设置

有办法解决这个问题吗？

您不能使用本地路径：

--archives ./environment.tar.gz#environment

在hdfs 上发布environment.tar.gz

venv-pack -o environment.tar.gz
# or conda pack
hdfs dfs -put -f environment.tar.gz /spark/app_name/
hdfs dfs -chmod 0664 /spark/app_name/environment.tar.gz

并更改spark提交的论点

--archives hdfs:///spark/app_name/environment.tar.gz#environment

更多信息：PySpark在YARN上的独立环境

相关内容

最新更新

热门标签：