PySpark使用Spark Submit:ivy缓存文件未找到错误在kubernetes上安装程序包



我一整天都在和它战斗。我可以安装并使用带有spark shell或连接的Jupiter笔记本的软件包(图框(,但我想将其移动到基于kubernetes的带有spark提交的spark环境中。我的火花版本:3.0.1我从spark包中下载了最后一个可用的.jar文件(graphframes-0.8.1-spark3.0_s2.12.jar(,并将其放在jars文件夹中。我使用标准spark docker文件的变体来构建我的图像。我的spark-submit命令看起来像:

$SPARK_HOME/bin/spark-submit 
--master k8s://https://kubernetes.docker.internal:6443 
--deploy-mode cluster 
--conf spark.executor.instances=$2 
--conf spark.kubernetes.container.image=myimage.io/repositorypath 
--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 
--jars "local:///opt/spark/jars/graphframes-0.8.1-spark3.0-s_2.12.jar" 
path/to/my/script/script.py

但它以一个错误结束

Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5-1.0.xml (No such file or directory)

以下是我的日志,仅供参考:

(base) konstantinigin@Konstantins-MBP spark-3.0.1-bin-hadoop3.2 % kubectl logs scalableapp-py-7669dd784bd59f67-driver
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=(.*)/1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' 3 == 2 ']'
+ '[' 3 == 3 ']'
++ python3 -V
+ pyv3='Python 3.7.3'
+ export PYTHON_VERSION=3.7.3
+ PYTHON_VERSION=3.7.3
+ export PYSPARK_PYTHON=python3
+ PYSPARK_PYTHON=python3
+ export PYSPARK_DRIVER_PYTHON=python3
+ PYSPARK_DRIVER_PYTHON=python3
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.1.2.145 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/data/ScalableApp.py --number_of_executors 2 --dataset USAir --links 100
Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5-1.0.xml (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
at org.apache.ivy.Ivy.resolve(Ivy.java:523)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

有人有什么熟悉的东西吗?也许你知道我在这里做错了什么?

使用spark-submit添加此配置对我有效:

spark-submit 
--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" 

这似乎是一个已知的火花问题,正在解决

https://github.com/apache/spark/pull/32397

好的,我解决了我的问题。不确定它是否适用于其他包,但它允许我在上述设置中运行图框:

  1. 从spark包下载最新的.jar文件
  2. 删除其名称的版本部分,只保留包名称。就我而言,它是:
mv ./graphframes-0.8.1-spark3.0-s_2.12.jar ./graphframes.jar
  1. 使用jar命令打开包装:
# Extract jar contents
jar xf graphframes.jar

现在来看第一点。我把我使用的所有包都放在一个依赖文件夹中,然后以压缩的形式提交给kubernetes。这个文件夹背后的逻辑在我的另一个问题中得到了解释,我再次回答了自己。请参见此处。现在,我使用jar命令将在上一步中提取的内容中的graphframes文件夹复制到我的dependencies文件夹中:4.将之前提取的内容中的图框文件夹复制到您的依赖项文件夹

cp -r ./graphframes $SPARK_HOME/path/to/your/dependencies
  1. 将原始.jar文件添加到$SPARK_HOME中的jar文件夹中
  2. 将--jar包含到指向新.jar文件的spark-submit命令中:
$SPARK_HOME/bin/spark-submit 
--master k8s://https://kubernetes.docker.internal:6443 
--deploy-mode cluster 
--conf spark.executor.instances=$2 
--conf spark.kubernetes.container.image=docker.io/path/to/your/image 
--jars "local:///opt/spark/jars/graphframes.jar"  ...
  1. 包括此处所述的依赖项

我现在很着急,但在不久的将来我会编辑这篇文章,添加一个链接到一篇关于在py spark中处理依赖关系的中短篇文章希望它对某人有用:(

我设法解决了一个类似的问题,即无法下载带有--package标志的hadoop azure jar。这绝对是一个变通办法,但它有效。

我修改了PySpark Docker容器,将入口点更改为:

ENTRYPOINT [ "/opt/entrypoint.sh" ]

现在我可以在不立即退出的情况下运行容器:

docker run -td <docker_image_id>

并且可以ssh进入其中:

docker exec -it <docker_container_id> /bin/bash

此时,我可以在容器内提交带有--package标志的spark作业:

$SPARK_HOME/bin/spark-submit 
--master local[*] 
--deploy-mode client 
--name spark-python 
--packages org.apache.hadoop:hadoop-azure:3.2.0 
--conf spark.hadoop.fs.azure.account.auth.type.user.dfs.core.windows.net=SharedKey 
--conf spark.hadoop.fs.azure.account.key.user.dfs.core.windows.net=xxx 
--files "abfss://data@user.dfs.core.windows.net/config.yml" 
--py-files "abfss://data@user.dfs.core.windows.net/jobs.zip" 
"abfss://data@user.dfs.core.windows.net/main.py"

Spark随后下载了所需的依赖项,并将它们保存在容器中的/root/.ivy2下,并成功执行了作业。

我把整个文件夹从容器复制到主机上:

sudo docker cp <docker_container_id>:/root/.ivy2/ /opt/spark/.ivy2/

并再次修改Dockerfile,将文件夹复制到镜像中:

COPY .ivy2 /root/.ivy2

最后,我可以用这个新构建的映像将作业提交给Kubernetes,一切都按预期运行。

最新更新