在pandas中我们知道df1.dot(df2.T)
表示点积但是当我在pySpark中运行
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-2219b97587ee> in <module>
----> 1 df1.dot(df2.T)
/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/lib/spark/python/pyspark/sql/dataframe.py in __getattr__(self, name)
1302 if name not in self.columns:
1303 raise AttributeError(
-> 1304 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
1305 jc = self._jdf.apply(name)
1306 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'dot'
您试过pandas-on-spark
吗?
import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)
df = ps.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
s = ps.Series([1, 1, 2, 1])
print(df @ s)
0 -4
1 5
dtype: int64
指出pyspark.pandas
要求pyspark >= 3.2
- 点积仅适用于数据帧(点)系列之间的操作。所以你不能用它来写:
df @ df
要将
spark RDD
转换为pandas-on-spark
,可以使用以下命令:pdf = df.to_pandas_on_spark()
从pandas-on-spark
返回到spark RDD
,使用:pdf.to_spark()