Python pandas_udf火花错误



我开始在本地玩spark,发现这个奇怪的问题

1(pip安装pyspark=2.3.12( pyspark>导入熊猫作为pd从pyspark.sql.functions导入pandas_udf、PandasUDFType、udfdf=pd.DataFrame({'x':[1,2,3],'y':[1.0,2.0,3.0]}(sp_df=spark.createDataFrame(df(@pandas_udf('long',PandasUDFType.SCALAR(def pandas_plus_one(v(:返回v+1sp_df.withColumn('v2',pandas_plus_one(sp_df.x((.show((

在此举此示例https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

知道我为什么一直犯这个错误吗?

py4j.protocol.Py4JJava错误:调用o108showString时出错。:org.apache.spark.SparkException:由于阶段失败而中止作业:阶段3.0中的任务0失败1次,最近一次失败:阶段3.0(TID 8,localhost,executor驱动程序(中丢失的任务0.0:org.apache_spark.spark Exception:Python工作程序意外退出(崩溃(网址:org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRuner.scala:333(网址:org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRuner.scala:322(在scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36(网址:org.apache.spark.sql.exexecution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:177(网址:org.apache.spark.sql.exexecution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121(网址:org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRuner.scala:252(网址:org.apache.spark.InterruptbleIterator.hasNext(InterruptableIterator.scala:37(网址:org.apache.spark.sql.exexecution.python.ArrowEvalPythonExec$$anon$2。(ArrowEval Pythonexec.scala:90(网址:org.apache.spark.sql.execution.python.ArrowEvalPythonExec.eevaluate(ArrowEval Pythonexec.scala:88(网址:org.apache.spark.sql.exexecution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131(网址:org.apache.spark.sql.exexecution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93(网址:org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfon$apply$23.apply(rdd.scala:800(网址:org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfon$apply$23.apply(rdd.scala:800(网址:org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址:org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址:org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址:org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址:org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址:org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址:org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址:org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址:org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址:org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址:org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址:org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址:org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87(网址:org.apache.spark.scheduler.Task.run(Task.scala:109(网址:org.apache.spark.executor.executor$TaskRunner.run(executor.scala:345(位于java.util.concurrent.ThreadPoolExecutiator.runWorker(ThreadPoolExecutiator.java:1142(位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617(在java.lang.Thread.run(线程.java:745(引起原因:java.io.EOFException位于java.io.DataInputStream.readInt(DataInputStream.java:392(网址:org.apache.spark.sql.exexecution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158(…27更多

我也遇到了同样的问题。我发现这是pandas和numpy之间的版本问题。

对我来说,以下作品:

numpy==1.14.5
pandas==0.23.4
pyarrow==0.10.0

之前我有以下非工作组合:

numpy==1.15.1
pandas==0.23.4
pyarrow==0.10.0

我发现问题只是pyarrow的不兼容版本。Spark 2.4.0是用pyarrow 0.10.0构建的(https://issues.apache.org/jira/browse/SPARK-23874)。

我将pyarrow包恢复到0.10.0(当前版本为0.15.x(,它运行得很好。

对我有效的配置是.

numpy==1.14.3
pandas==0.23.0
pyarrow==0.10.0

最新更新