使用panda.read_csv与numpy.loadtext时的输出差异



我目前正在使用Python学习吴恩达的ML课程,我在谷歌上搜索我的输出是否与其他输出匹配,但我意识到,正是因为使用了numpy而不是panda,我的最终输出才改变了

预计:2105448288.6292474我的输出:2064911681.618526

在使用不同的模块时,是否经常观察到这种差异?

参考代码:

import numpy as np
import pandas as pd
data = pd.read_csv('ex1data2.txt', sep = ',', header = None)
X = data.iloc[:,0:2] # read first two columns into X
y = data.iloc[:,2] # read the third column into y
m = len(y) # no. of training samples
data.head()
X = (X - np.mean(X))/np.std(X)
ones = np.ones((m,1))
X = np.hstack((ones, X))
alpha = 0.01
num_iters = 400
theta = np.zeros((3,1))
y = y[:,np.newaxis]
def computeCostMulti(X, y, theta):
temp = np.dot(X, theta) - y
return np.sum(np.power(temp, 2)) / (2*m)
J = computeCostMulti(X, y, theta)
def gradientDescentMulti(X, y, theta, alpha, iterations):
m = len(y)
for _ in range(iterations):
temp = np.dot(X, theta) - y
temp = np.dot(X.T, temp)
theta = theta - (alpha/m) * temp
return theta
theta = gradientDescentMulti(X, y, theta, alpha, num_iters)
J = computeCostMulti(X, y, theta)
print(J)

我的代码:

import numpy as np
from matplotlib import pyplot as plt
data = np.loadtxt("ex1data2.txt", delimiter = ",", dtype = 'int')
x = data[:, 0:2]
y = data[:, 2]
m = len(y)
x = x.reshape(m,2)
y = y.reshape(m,1)
one = np.ones((m,1))
X = np.matrix(x)
X = (X - np.mean(X))/np.std(X)
X = np.concatenate((one,X), axis = 1)
theta = np.zeros((3, 1))
def cc(theta, X, y):
A = np.dot(X,theta)-y
return float(((1/(2*m)) * np.dot(A.T, A)))

def gd(theta, X, y, alpha, iterations):
for i in range(iterations):
h = np.dot(X, theta)-y
h = np.dot(X.T, h)
theta = theta - (alpha/m) * h
return theta
theta = (gd(theta, X, y, 0.01, 400))
print(cc(theta, X, y))

区别在于矩阵运算如何应用于pandas数据帧与numpy矩阵。

例如,熊猫数据帧:

data = pd.read_csv('ex1data2.txt', header=None)
X_df = data.iloc[:, 0:2]
np.mean(X_df)
# 0    2000.680851
# 1       3.170213
# dtype: float64

vs numpy矩阵:

data = np.loadtxt('ex1data2.txt', delimiter=',', dtype='int')
x = data[:, 0:2]
X_mat = np.matrix(x)
np.mean(X_mat)
# 1001.9255319148937

这里需要axis=0来复制熊猫的行为:

np.mean(X_mat, axis=0)
# matrix([[2000.68085106,    3.17021277]])

我还没有浏览过你的numpy版本的每一行来调试每一个差异,但这是根本原因。从pd.read_csvnp.load_txt加载的数据是等效的。后续的矩阵运算因轴处理而有所不同。

最新更新