我有一个Web应用程序,可以在Pyspark中运行长期运行的任务。我正在使用django和芹菜来异步运行任务。
我有一块代码在控制台中执行时效果很好。但是,当我通过芹菜任务运行时会遇到很多错误。首先,由于某种原因,我的UDF不起作用。我将其放在一个try-except块中,它总是插入除块。
除外。try:
func = udf(lambda x: parse(x), DateType())
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
raise ValueError("No valid date format found.")
错误:
[2018-04-05 07:47:37,223: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[afbda586-0929-4d51-87f1-d612cbdb4c5e] raised unexpected: Py4JError('An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:npy4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not existntat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)ntat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)ntat py4j.Gateway.invoke(Gateway.java:235)ntat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)ntat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)ntat py4j.GatewayConnection.run(GatewayConnection.java:214)ntat java.lang.Thread.run(Thread.java:748)nn',)
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 68, in outlier_algorithm
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 179, in wrapper
return self(*args)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 157, in __call__
judf = self._judf
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 141, in _judf
self._judf_placeholder = self._create_judf()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 153, in _create_judf
self._name, wrapped_func, jdt, self.evalType, self.deterministic)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
此外,我正在使用topandas((转换数据框并在其上运行一些熊猫功能,但会引发以下错误:
[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.n', 'JavaObject id=o225')
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
data_frame_new = data_frame_1.toPandas()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
return self.sparkSession.conf.get(key, defaultValue)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
return self._jconf.get(key)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can't pickle <class 'py4j.protocol.Py4JJavaError'>: it's not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can't pickle <class 'py4j.protocol.Py4JJavaError'>: it's not the same object as py4j.protocol.Py4JJavaError",)''.
我遇到了这个问题,并且很难将其固定。事实证明,如果您正在运行的Spark版本与您执行的Pyspark的版本不匹配,则可能会发生此错误。就我而言,我正在运行Spark 2.2.3.4,并试图使用Pyspark 2.4.4。将Pyspark降至2.2.3之后,问题消失了。我遇到了由Pyspark中使用功能的代码引起的另一个问题,该问题是在2.2.3之后添加的,但这是另一个问题。
这只是不起作用。Spark使用复杂的状态,包括JVM状态,该状态不能简单地序列化并发送给工人。如果要异步运行代码,请使用线程池提交作业。
我正在回答自己的问题。这可能是Pyspark 2.3错误我正在使用Pyspark 2.3.0,由于某种原因,它与Python 3.5无法正常工作。我降级到Pyspark 2.1.2,一切都很好。