这是我第一个将kedro与Pyspark一起使用的项目,我遇到了一个问题。我使用新款Mac(M1(。当我在终端中执行spark-shell
时,spark已成功安装,并且我有正确的输出(欢迎使用带有图片的spark 3.2.1版本(。然而,我试图运行火花使用Kedro项目,我有一个麻烦。由于讨论了堆栈溢出,我试图找到解决方案,但没有找到与此相关的解决方案。
版本:
- Python:3.8
- Java:openjdk版本";18〃;2022-03-22
- PySpark:3.2.1
Spark conf:
spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
在我的Kedro:项目背景下
class ProjectContext(KedroContext):
"""A subclass of KedroContext to add Spark initialisation for the pipeline."""
def __init__(
self,
package_name: str,
project_path: Union[Path, str],
env: str = None,
extra_params: Dict[str, Any] = None,
):
super().__init__(package_name, project_path, env, extra_params)
if not os.getenv('DISABLE_SPARK'):
self.init_spark_session()
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config
defined in project's conf folder.
"""
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
.master("local[*]")
)
_spark_session = spark_session_conf.getOrCreate()
当我运行它时,我会出现以下错误:
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3c60b7e7) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x3c60b7e7
at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:460)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
在我的终端中,我调整了命令以匹配我的Python路径:
export HOMEBREW_OPT="/opt/homebrew/opt"
export JAVA_HOME="$HOMEBREW_OPT/openjdk/"
export SPARK_HOME="$HOMEBREW_OPT/apache-spark/libexec"
export PATH="$JAVA_HOME:$SPARK_HOME:$PATH"
export SPARK_LOCAL_IP=localhost
感谢您的帮助
Hi@Mathilde Roblot感谢您的详细报告-
具体错误"cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module
"让我印象深刻。
谷歌搜索表明,您可能检索到了错误的Java(不是spark要求的8.0(
- https://stackoverflow.com/a/49453770/2010808
- https://stackoverflow.com/a/69851663/2010808
您可以使用一些SparkConf
来设置所需的--add-opens
,请参阅:https://stackoverflow.com/a/71855571/13547620.
当您的spark env库没有被Kedro提取或Kedro无法在您的env中找到spark时,也会发生这种情况。
QQ:正在使用像PyCharm这样的IDE,如果是这样的话,你可能需要转到首选项并嵌入你的env变量。我也遇到过同样的问题,从项目偏好设置env变量帮助我
希望这能帮助