我正在尝试使用pandas-udf提交pyspark代码(使用fb先知…(它在本地提交中运行良好,但在等集群提交中出错
Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 41, ip-172-31-11-94.ap-northeast-2.compute.internal, executor 2): java.io.IOException: Cannot run program
"/mnt/yarn/usercache/hadoop/appcache/application_1620263926111_0229/container_1620263926111_0229_01_000001/environment/bin/python": error=2, No such file or directory
我的火花提交代码:
PYSPARK_PYTHON=./environment/bin/python
spark-submit
--master yarn
--deploy-mode cluster
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--jars jars/org.elasticsearch_elasticsearch-spark-20_2.11-7.10.2.jar
--py-files dependencies.zip
--archives ./environment.tar.gz#environment
--files config.ini
$1
我用conda包制作了environment.tar.gz,dependencies.zip作为我的本地包config.ini加载设置
有办法解决这个问题吗?
您不能使用本地路径:
--archives ./environment.tar.gz#environment
在hdfs 上发布environment.tar.gz
venv-pack -o environment.tar.gz
# or conda pack
hdfs dfs -put -f environment.tar.gz /spark/app_name/
hdfs dfs -chmod 0664 /spark/app_name/environment.tar.gz
并更改spark提交的论点
--archives hdfs:///spark/app_name/environment.tar.gz#environment
更多信息:PySpark在YARN上的独立环境