从头实现numpy协方差矩阵



我试图通过从头开始实现协方差矩阵来模拟np.cov函数。然而,我的代码似乎没有给出与np.cov

相同的输出代码:

import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/User/Downloads/Admission_Predict.csv')
X = df.values
N, M = X.shape
means = np.zeros(M)  # M many of them
stdevs = np.zeros(M)
Xcoeff = np.zeros((M, M))
# Mean
for i in range(M):
means[i] = np.sum(X[:, i]) / N
stdevs[i] = math.sqrt(sum(pow(x-means[i], 2) for x in X[:, i]) / (N-1))
# Covariance matrix
for j in range(M):
mat0 = mat[i][j] - [means][0].reshape(M, -1)
covariance = (mat0 * mat0.T) / (N-1)

期望的矩阵值

print(np.cov(df))
> [[14128.00654107 13533.16488393 13222.07435357 ... 13831.92691786
>   13050.78170893 13961.07189821]  [13533.16488393 12968.32105536 12670.19783929 ... 13249.25808929
>   12505.84390893 13372.93946964]  [13222.07435357 12670.19783929 12379.07033571 ... 12944.65915
>   12218.34000357 13065.526925  ]  ...  [13831.92691786 13249.25808929 12944.65915    ... 13542.10545
>   12777.00158214 13668.54191786]  [13050.78170893 12505.84390893 12218.34000357 ... 12777.00158214
>   12060.0142125  12896.28555179]  [13961.07189821 13372.93946964 13065.526925   ... 13668.54191786
>   12896.28555179 13796.19808393]]

我的输出矩阵值

print(covariance)
> [ 3.47493270e+02  1.17319616e+02  2.64636910e+00  2.98987496e+00
>   3.04758394e+00  8.70463072e+00 -1.45646482e-01  4.87503509e-02]

参见下面的代码,注意您需要在np.cov中设置rowvar=False,以便计算数据框架列之间的协方差。

import pandas as pd
import numpy as np
# Load the data
df = pd.read_csv('Admission_Predict.csv')
# Extract the data
X = df.values
# Extract the number of rows and columns
N, M = X.shape
# Calculate the covariance matrix
cov = np.zeros((M, M))
for i in range(M):
# Mean of column "i"
mean_i = np.sum(X[:, i]) / N
for j in range(M):
# Mean of column "j"
mean_j = np.sum(X[:, j]) / N
# Covariance between column "i" and column "j"
cov[i, j] = np.sum((X[:, i] - mean_i) * (X[:, j] - mean_j)) / (N - 1)
# Compare with numpy covariance matrix
np_cov = np.cov(df.values, rowvar=False)
np_cov_diff = np.sum(np.abs(cov - np_cov))
print('Difference from numpy cov. mat.: {:.12f}'.format(np_cov_diff))
# Difference from numpy cov. mat.: 0.000000000000
# Compare with pandas covariance matrix
pd_cov = df.cov().values
pd_cov_diff = np.sum(np.abs(cov - pd_cov))
print('Difference from pandas cov. mat.: {:.12f}'.format(pd_cov_diff))
# Difference from pandas cov. mat.: 0.000000000000

最新更新