将预测因子添加到随机森林分类器(Pandas，Python3，Sklearn)中

我正在尝试通过适应代码在此处找到：http：//blog.yhathq.com/posts/random-forests-in--python.html到假数据集

我试图根据其体重和身高

来预测一个人是男性（0）还是女性（1）

数据看起来像：

  Weight     Height     Gender
  150         60          1
  250         85          0
  175         75          0
  100         62          1
  90          58          1
  200         80          0
  ...         ...         ...
  165         66          0

现在，我正在尝试将测试集分类为男性和女性

这是代码：

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
xl = pd.ExcelFile(fakedata.xlsx')
df = xl.parse()
df.head()
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = df.columns[:2]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['Gender'])
clf.fit(train[features], y)

我了解此代码在这里完成了什么，但是我遇到了问题：

preds = train['Gender'][clf.predict(test[features])]
print(pd.crosstab(test['Gender'], preds, rownames=['actual'], colnames=['preds']))

给我错误

ValueError: cannot reindex from a duplicate axis

我到底想念什么？

您不应该通过行preds = train['Gender'][clf.predict(test[features])]中的预测来索引。您的预测应该只是

preds = clf.predict(test[features])

相关内容

最新更新

热门标签：