Python pandas_udf火花错误

我开始在本地玩spark，发现这个奇怪的问题

1(pip安装pyspark=2.3.12( pyspark>导入熊猫作为pd从pyspark.sql.functions导入pandas_udf、PandasUDFType、udfdf=pd.DataFrame(｛'x'：[1,2,3]，'y'：[1.0,2.0,3.0]｝(sp_df=spark.createDataFrame(df(@pandas_udf('long'，PandasUDFType.SCALAR(def pandas_plus_one(v(：返回v+1sp_df.withColumn('v2'，pandas_plus_one(sp_df.x((.show((

在此举此示例https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

知道我为什么一直犯这个错误吗？

py4j.protocol.Py4JJava错误：调用o108showString时出错。：org.apache.spark.SparkException：由于阶段失败而中止作业：阶段3.0中的任务0失败1次，最近一次失败：阶段3.0(TID 8，localhost，executor驱动程序(中丢失的任务0.0：org.apache_spark.spark Exception：Python工作程序意外退出(崩溃(网址：org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRuner.scala:333(网址：org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRuner.scala:322(在scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36(网址：org.apache.spark.sql.exexecution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:177(网址：org.apache.spark.sql.exexecution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121(网址：org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRuner.scala:252(网址：org.apache.spark.InterruptbleIterator.hasNext(InterruptableIterator.scala:37(网址：org.apache.spark.sql.exexecution.python.ArrowEvalPythonExec$$anon$2。(ArrowEval Pythonexec.scala:90(网址：org.apache.spark.sql.execution.python.ArrowEvalPythonExec.eevaluate(ArrowEval Pythonexec.scala:88(网址：org.apache.spark.sql.exexecution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131(网址：org.apache.spark.sql.exexecution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93(网址：org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfon$apply$23.apply(rdd.scala:800(网址：org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfon$apply$23.apply(rdd.scala:800(网址：org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址：org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址：org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址：org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址：org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址：org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址：org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址：org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址：org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址：org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitions rdd.scala:38(网址：org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324(网址：org.apache.spark.rdd.rdd.iterator(rdd.scala:288(网址：org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87(网址：org.apache.spark.scheduler.Task.run(Task.scala:109(网址：org.apache.spark.executor.executor$TaskRunner.run(executor.scala:345(位于java.util.concurrent.ThreadPoolExecutiator.runWorker(ThreadPoolExecutiator.java:1142(位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617(在java.lang.Thread.run(线程.java：745(引起原因：java.io.EOFException位于java.io.DataInputStream.readInt(DataInputStream.java:392(网址：org.apache.spark.sql.exexecution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158(…27更多

我也遇到了同样的问题。我发现这是pandas和numpy之间的版本问题。

对我来说，以下作品：

numpy==1.14.5
pandas==0.23.4
pyarrow==0.10.0

之前我有以下非工作组合：

numpy==1.15.1
pandas==0.23.4
pyarrow==0.10.0

我发现问题只是pyarrow的不兼容版本。Spark 2.4.0是用pyarrow 0.10.0构建的(https://issues.apache.org/jira/browse/SPARK-23874)。

我将pyarrow包恢复到0.10.0(当前版本为0.15.x(，它运行得很好。

对我有效的配置是.

numpy==1.14.3
pandas==0.23.0
pyarrow==0.10.0

相关内容

最新更新

热门标签：