createDataFrame (pyspark) 生成一个奇怪的错误(py4j 错误)



我写了这4行简单的代码:

import pyspark
from pyspark.sql import SparkSession
spa = SparkSession.builder.getOrCreate()
spa.createDataFrame([(1,2,3)], ["count"])

但是创建数据帧函数正在生成这个巨大的错误:

Py4JError Traceback(最近一次调用( 最后( 在 3 spa = SparkSession.builder.getOrCreate(( 4 ----> 5 spa.createDataFrame([(1,2,3(], ["count"](

C:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\session.py in createDataFrame(self, data, schema, ssamplelingRatio, verifySchema( 690 其他: 691 RDD, schema = self._createFromLocal(map(prepare, data(, schema( --> 692 JRDD = self._jvm。SerDeUtil.toJavaArray(rdd._to_java_object_rdd((( 693 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd((, schema.json((( 694 df = DataFrame(jdf, self._wrapped(

C:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\rdd.py 在_to_java_object_rdd(自身(2294 "" 2295 RDD = self._pickled(( -> 2296 返回self.ctx._jvm。SerDeUtil.pythonToJava(rdd._jrdd, True( 2297 2298 def countApprox(self, timeout, 信心=0.95(:

C:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\rdd.py 在_jrdd(自我(2472
self._jrdd_deserializer,探查器(2473 python_rdd = self.ctx._jvm。PythonRDD(self._prev_jrdd.rdd((, wrapped_func, -> 2474 自我保留分区( 2475 self._jrdd_val = python_rdd.asJavaRDD(( 2476

C:\users\hp\appdata\local\programs\python\python37\lib\site-packages\py4j\java_gateway.py 在呼叫(自我,*参数(1523 答案 = self._gateway_client.send_command(command( 1524
return_value = get_return_value( -> 1525 回答,self._gateway_client,无,self._fqn( 1526 1527 temp_arg在temp_args:

C:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\utils.py 在装饰(*a, **kw( 61 def deco(*a, **kw(: 62 尝试: ---> 63 返回 f(*a, **kw( 64 除了 py4j.protocol.Py4JJavaError 作为 e: 65 s = e.java_exception.toString((

C:\users\hp\appdata\local\programs\python\python37\lib\site-packages\py4j\protocol.py 在get_return_value(答案、gateway_client、target_id、姓名( 330 提高 Py4JError( 331 "呼叫{0}{1}{2}时出错。跟踪:{3}"。 --> 332格式(target_id,".",名称,值(( 333 其他: 334 提高 Py4JError(

> Py4JError:调用时出错 None.org.apache.spark.api.python.PythonRDD. Trace: py4j.Py4JException: Constructor org.apache.spark.api.python.PythonRDD([class org.apache.spark.rdd.ParallelCollectionRDD, class org.apache.spark.api.python.PythonFunction, class java.lang.Boolean]( 在 中不存在 py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179( 在 py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196( 在 PY4J。Gateway.invoke(Gateway.java:237( at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80( 在 py4j.command.ConstructorCommand.execute(ConstructorCommand.java:69( 在 PY4J。GatewayConnection.run(GatewayConnection.java:238( at java.lang.Thread.run(Thread.java:748(

为什么会这样? 该代码实际上与其他教程相同,并且在那里工作正常......

试试这个它正在工作。初始化时,在值后放置一个逗号。

import pyspark
from pyspark.sql import SparkSession
spa = SparkSession.builder.getOrCreate()
df = spa.createDataFrame(sc.parallelize([(1,), (2,), (3,)]), ("count",),)

输出:

+-----+
|count|
+-----+
|    1|
|    2|
|    3|
+-----+

希望这有帮助!

最新更新