Sagemaker Studio Pyspark示例失败



当我尝试运行Sagemaker提供的示例与Sagemaker Studio中的PySpark

import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import sagemaker
from sagemaker import get_execution_role
import sagemaker_pyspark
role = get_execution_role()
# Configure Spark to use the SageMaker Spark dependency jars
jars = sagemaker_pyspark.classpath_jars()
classpath = ":".join(sagemaker_pyspark.classpath_jars())
# See the SageMaker Spark Github repo under sagemaker-pyspark-sdk
# to learn how to connect to a remote EMR cluster running Spark from a Notebook Instance.
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath)
.master("local[*]").getOrCreate()

我得到以下异常:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-6-c8f6fff0daaf> in <module>
19 # to learn how to connect to a remote EMR cluster running Spark from a Notebook Instance.
20 spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath)
---> 21     .master("local[*]").getOrCreate()
/opt/conda/lib/python3.6/site-packages/pyspark/sql/session.py in getOrCreate(self)
171                     for key, value in self._options.items():
172                         sparkConf.set(key, value)
--> 173                     sc = SparkContext.getOrCreate(sparkConf)
174                     # This SparkContext may be an existing one.
175                     for key, value in self._options.items():
/opt/conda/lib/python3.6/site-packages/pyspark/context.py in getOrCreate(cls, conf)
361         with SparkContext._lock:
362             if SparkContext._active_spark_context is None:
--> 363                 SparkContext(conf=conf or SparkConf())
364             return SparkContext._active_spark_context
365 
/opt/conda/lib/python3.6/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
127                     " note this option will be removed in Spark 3.0")
128 
--> 129         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
130         try:
131             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/opt/conda/lib/python3.6/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
310         with SparkContext._lock:
311             if not SparkContext._gateway:
--> 312                 SparkContext._gateway = gateway or launch_gateway(conf)
313                 SparkContext._jvm = SparkContext._gateway.jvm
314 
/opt/conda/lib/python3.6/site-packages/pyspark/java_gateway.py in launch_gateway(conf)
44     :return: a JVM gateway
45     """
---> 46     return _launch_gateway(conf)
47 
48 
/opt/conda/lib/python3.6/site-packages/pyspark/java_gateway.py in _launch_gateway(conf, insecure)
106 
107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
109 
110             with open(conn_info_file, "rb") as info:
Exception: Java gateway process exited before sending its port number

在运行示例之前,我使用笔记本中的pip安装了pyspark和sagemaker_pyspark。我还使用了SageMaker内核库中的SparkMagic内核。

也许,您遇到这个问题是因为这个笔记本被设计为在您有EMR集群时运行。我建议你在Sagemaker上安装conda_python3内核,而不是SparkMagic内核。您需要使用pip安装pysparksagemaker_pyspark,但它应该与您发布的代码一起工作。

您也可以使用SparkMagic内核,它在Studio中默认是可用的。这个内核包含了使用sparkmagic连接EMR集群和提交spark代码或运行SQL查询的所有库。

请参阅下面的博客文章如何使用SparkMagic内核与EMR:https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-studio-notebooks-backed-by-spark-in-amazon-emr/

最新更新