我有这样的熊猫数据框(还有更多的列,其余的都是数字(。
import pandas as pd
from sklearn.dummy import DummyClassifier
df = pd.DataFrame({'Time':['2013-08-01 00:00:00', '2014-09-01 12:10:00', '2015-02-02 10:10:00', '2016-01-01 00:00:00'], 'Model_Targ':['a', 'b', 'a', 'b'], 'Col2':[-0.945000, -0.855000, -0.860000, -0.945000], 'Col3':[64.384028, 64.485417, 64.609028, 64.723611]})
df['Time'] = pd.to_datetime(df['Time'])
TrainSet = df[df['Time']<'2015-01-01']
TestSet = df[df['Time']>'2015-01-01']
如果我使用
Train_Y = TrainSet.iloc[:, 1]
Train_X = TrainSet.drop(TrainSet.columns[[0,1]], axis=1)
Test_y = TestSet.iloc[:,1]
Test_x = TestSet.drop(TestSet.columns[[0,1]], axis=1)
它在 Sklearns DummyClassifier()
中工作正常如果我使用
Columns_to_drop = df.filter(like='Targ', axis = 1).columns.values.tolist()
Columns_to_drop.append('Time')
Train_Y = TrainSet.filter(like='Targ', axis = 1)
Train_X = TrainSet.drop(Columns_to_drop, axis=1)
Test_y = TestSet.filter(like='Targ', axis = 1)
Test_x = TestSet.drop(Columns_to_drop, axis=1)
我在虚拟分类器中收到错误。
clf = DummyClassifier()
clf.fit(Train_X , Train_Y)
Predict_y = clf.predict(Test_x)
我比较了两个帧,它返回了一个巨大的TRUE
矩阵
/usr/local/lib/python2.7/dist-packages/sklearn/dummy.pyc in predict(self, X)
174
175 elif self.strategy == "stratified":
--> 176 ret = proba[k].argmax(axis=1)
177
178 elif self.strategy == "uniform":
AttributeError: 'list' object has no attribute 'argmax'
显示其工作的代码不起作用,因为您错误地索引了测试和训练集。该代码应该是这样的:
#! Index([u'Col2', u'Col3', u'Model_Targ', u'Time'], dtype='object')
Train_Y = TrainSet.iloc[:, 2]
Train_X = TrainSet.drop(TrainSet.columns[[2,3]], axis=1)
Test_y = TestSet.iloc[:,2]
Test_x = TestSet.drop(TestSet.columns[[2,3]], axis=1)
现在,它在第二个代码示例中不起作用的原因是,您将数据帧返回到目标集 (Train_Y,Test_y(。这是一个问题,因为 DummyClassifier predict
方法调用 argmax
方法,DataFrames 本身没有该方法,但它们的列(系列(可以调用。因此,要使第二个代码示例正常工作,只需指定列名称即可提取系列。
import pandas as pd
from sklearn.dummy import DummyClassifier
df = pd.DataFrame({'Time':['2013-08-01 00:00:00', '2014-09-01 12:10:00', '2015-02-02 10:10:00', '2016-01-01 00:00:00'], 'Model_Targ':['a', 'b', 'a', 'b'], 'Col2':[-0.945000, -0.855000, -0.860000, -0.945000], 'Col3':[64.384028, 64.485417, 64.609028, 64.723611]})
df['Time'] = pd.to_datetime(df['Time'])
TrainSet = df[df['Time']<'2015-01-01']
TestSet = df[df['Time']>'2015-01-01']
Columns_to_drop = df.filter(like='Targ', axis = 1).columns.values.tolist()
Columns_to_drop.append('Time')
Train_Y = TrainSet.filter(like='Targ', axis = 1)['Model_Targ'] #!
Train_X = TrainSet.drop(Columns_to_drop, axis=1)
Test_y = TestSet.filter(like='Targ', axis = 1)['Model_Targ'] #!
Test_x = TestSet.drop(Columns_to_drop, axis=1)
clf = DummyClassifier()
clf.fit(Train_X , Train_Y)
Predict_y = clf.predict(Test_x)
print Predict_y