我正在尝试通过适应代码在此处找到:http://blog.yhathq.com/posts/random-forests-in--python.html到假数据集
我试图根据其体重和身高
来预测一个人是男性(0)还是女性(1)数据看起来像:
Weight Height Gender
150 60 1
250 85 0
175 75 0
100 62 1
90 58 1
200 80 0
... ... ...
165 66 0
现在,我正在尝试将测试集分类为男性和女性
这是代码:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
xl = pd.ExcelFile(fakedata.xlsx')
df = xl.parse()
df.head()
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = df.columns[:2]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['Gender'])
clf.fit(train[features], y)
我了解此代码在这里完成了什么,但是我遇到了问题:
preds = train['Gender'][clf.predict(test[features])]
print(pd.crosstab(test['Gender'], preds, rownames=['actual'], colnames=['preds']))
给我错误
ValueError: cannot reindex from a duplicate axis
我到底想念什么?
您不应该通过行preds = train['Gender'][clf.predict(test[features])]
中的预测来索引。您的预测应该只是
preds = clf.predict(test[features])