如何获得pyspark数据帧的相关矩阵?新的2020年

我有同样的问题：

如何获得pyspark数据帧的相关矩阵？

"我有一个很大的pyspark数据帧。我想得到它的相关矩阵。我知道如何用pandas数据帧来得到它。但我的数据太大了，无法转换成pandas。所以我需要用pyspark数据帧来获得结果。我搜索了其他类似的问题，但答案对我不起作用。有人能帮我吗？谢谢！">

df4是我的数据集，他有9列，所有列都是整数：

reference__YM_unix:integer
tenure_band:integer
cei_global_band:integer
x_band:integer
y_band:integer
limit_band:integer
spend_band:integer
transactions_band:integer
spend_total:integer

我首先完成了这一步：

# convert to vector column first
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df4.columns, outputCol=vector_col)
df_vector = assembler.transform(df4).select(vector_col)
# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)

并得到以下输出：

(matrix.collect()[0]["pearson({})".format(vector_col)].values)
Out[33]: array([ 1.        ,  0.0760092 ,  0.09051543,  0.07550633, -0.08058203,
-0.24106848,  0.08229602, -0.02975856, -0.03108094,  0.0760092 ,
1.        ,  0.14792512, -0.10744735,  0.29481762, -0.04490072,
-0.27454922,  0.23242408,  0.32051685,  0.09051543,  0.14792512,
1.        , -0.03708623,  0.13719527, -0.01135489,  0.08706559,
0.24713638,  0.37453265,  0.07550633, -0.10744735, -0.03708623,
1.        , -0.49640664,  0.01885793,  0.25877516, -0.05019079,
-0.13878844, -0.08058203,  0.29481762,  0.13719527, -0.49640664,
1.        ,  0.01080777, -0.42319841,  0.01229877,  0.16440178,
-0.24106848, -0.04490072, -0.01135489,  0.01885793,  0.01080777,
1.        ,  0.00523737,  0.01244241,  0.01811365,  0.08229602,
-0.27454922,  0.08706559,  0.25877516, -0.42319841,  0.00523737,
1.        ,  0.32888075,  0.21416322, -0.02975856,  0.23242408,
0.24713638, -0.05019079,  0.01229877,  0.01244241,  0.32888075,
1.        ,  0.53310864, -0.03108094,  0.32051685,  0.37453265,
-0.13878844,  0.16440178,  0.01811365,  0.21416322,  0.53310864,
1.        ])

我试着将这个结果插入数组或excel文件中，但没有成功。我做到了：

matrix2 = (matrix.collect()[0]["pearson({})".format(vector_col)])

然后我在尝试显示此信息时出现以下错误：

display(matrix2)
Exception: ML model display does not yet support model type <class 'pyspark.ml.linalg.DenseMatrix'>.

我本来想从df4中插入列的名称，但没有成功，我读到我需要使用df4.columns，但我不知道它是如何工作的。

最后，我希望打印出我从介质文章中看到的以下图表

https://medium.com/towards-artificial-intelligence/feature-selection-and-dimensionality-reduction-using-covariance-matrix-plot-b4c7498abd07

但它也不起作用：

from sklearn.preprocessing import StandardScaler 
stdsc = StandardScaler() 
X_std = stdsc.fit_transform(df4.iloc[:,range(0,7)].values)
cov_mat =np.cov(X_std.T)
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
hm = sns.heatmap(cov_mat,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 12},
cmap='coolwarm',                 
yticklabels=cols,
xticklabels=cols)
plt.title('Covariance matrix showing correlation coefficients', size = 18)
plt.tight_layout()
plt.show()

AttributeError: 'DataFrame' object has no attribute 'iloc'

我试图将df4替换为matrix2，但不太适用

您可以使用以下方法以可操作的形式获取相关矩阵：

matrix = matrix.toArray().tolist()

从那里你可以转换为数据帧pd.DataFrame(matrix)，这将允许你绘制热图，或保存到excel等。

相关内容

最新更新

热门标签：