如何在 pyspark 中使用 .dot(属性错误:'DataFrame'对象没有属性"dot")



在pandas中我们知道df1.dot(df2.T)表示点积但是当我在pySpark中运行

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-2219b97587ee> in <module>
----> 1 df1.dot(df2.T)
/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/lib/spark/python/pyspark/sql/dataframe.py in __getattr__(self, name)
1302         if name not in self.columns:
1303             raise AttributeError(
-> 1304                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
1305         jc = self._jdf.apply(name)
1306         return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'dot'

您试过pandas-on-spark吗?

import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)
df = ps.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
s = ps.Series([1, 1, 2, 1])
print(df @ s)
0   -4
1    5
dtype: int64
指出

  • pyspark.pandas要求pyspark >= 3.2
  • 点积仅适用于数据帧(点)系列之间的操作。所以你不能用它来写:df @ df
  • 要将spark RDD转换为pandas-on-spark,可以使用以下命令:pdf = df.to_pandas_on_spark()pandas-on-spark返回到spark RDD,使用:pdf.to_spark()

最新更新