我在csv中有以下数据,第一行表示列标题和数据已索引,所有数据都进行了二分化。我需要制作一个决策树分类器模型。有人可以指导我吗?
,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,"(16.927, 41.333]", State-gov,"(10806.885, 504990]", Bachelors,"(12, 16]", Never-married, Adm-clerical, Not-in-family, White, Male,"(0, 5000]",,"(30, 50]", United-States, <=50K
1,"(41.333, 65.667]", Self-emp-not-inc,"(10806.885, 504990]", Bachelors,"(12, 16]", Married-civ-spouse, Exec-managerial, Husband, White, Male,,,"(0, 30]", United-States, <=50K
2,"(16.927, 41.333]", Private,"(10806.885, 504990]", HS-grad,"(8, 12]", Divorced, Handlers-cleaners, Not-in-family, White, Male,,,"(30, 50]", United-States, <=50K
3,"(41.333, 65.667]", Private,"(10806.885, 504990]", 11th,"(-1, 8]", Married-civ-spouse, Handlers-cleaners, Husband, Black, Male,,,"(30, 50]", United-States, <=50K
4,"(16.927, 41.333]", Private,"(10806.885, 504990]", Bachelors,"(12, 16]", Married-civ-spouse, Prof-specialty, Wife, Black, Female,,,"(30, 50]", Cuba, <=50K
到目前为止,我的蟑螂:
df, filen = decision_tree.readCSVFile("../Data/discretized.csv")
print df[:3]
newdf = decision_tree.catToInt(df)
print newdf[:3]
model = DecisionTreeClassifier(random_state=0)
print cross_val_score(model, newdf, newdf[:,14], cv=10)
catToInt 功能:
def catToInt(df):
mapper={}
categorical_list = list(df.columns.values)
newdf = pd.DataFrame(columns=categorical_list)
#Converting Categorical Data
for x in categorical_list:
mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
someinput = df.__getattr__(x)
newcol = mapper[x].fit_transform(someinput)
newdf[x]= newcol
return newdf
错误 :
print cross_val_score(model, newdf, newdf[:,14], cv=10)
File "C:Python27libsite-packagespandascoreframe.py", line 1787, in __getitem__
return self._getitem_column(key)
File "C:Python27libsite-packagespandascoreframe.py", line 1794, in _getitem_column
return self._get_item_cache(key)
File "C:Python27libsite-packagespandascoregeneric.py", line 1077, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type
所以我能够将分类数据转换为 int.,但我认为我在下一步中缺少一些东西.
这是我通过遵循上面的评论和更多搜索获得的解决方案。我得到了预期的结果,但我知道会有更精细的方法可以做到这一点。
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
import pandas as pd
from sklearn import preprocessing
def main():
df, _ = readCSVFile("../Data/discretized.csv")
newdf, classl = catToInt(df)
model = DecisionTreeClassifier()
print cross_val_score(model, newdf, classl, cv=10)
def readCSVFile(filepath):
df = pd.read_csv(filepath, index_col=0)
(_, _, sufix) = filepath.rpartition('\')
(prefix, _, _) =sufix.rpartition('.')
print "csv read and converted to dataframe !!"
# df['class'] = df['class'].apply(replaceLabel)
return df, prefix
def catToInt(df):
# replace the Nan with "NA" which acts as a unique category
df.fillna("NA", inplace=True)
mapper={}
# make list of all column headers
categorical_list = list(df.columns.values)
#exclude the class column
categorical_list.remove('class')
newdf = pd.DataFrame(columns=categorical_list)
#Converting Categorical Data to integer labels
for x in categorical_list:
mapper[x]=preprocessing.LabelEncoder()
for x in categorical_list:
newdf[x]= mapper[x].fit_transform(df.__getattr__(x))
# make a class series encoded :
le = preprocessing.LabelEncoder()
myclass = le.fit_transform(df.__getattr__('class'))
#newdf is the dataframe with all columns except classcoumn and myclass is the class column
return newdf, myclass
main()
上面评论以外的一些链接对我有帮助:
- http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
- http://biggyani.blogspot.com/2014/08/using-onehot-with-categorical.html
输出:
csv read and converted to dataframe !!
[ 0.83418628 0.83930399 0.83172979 0.82804504 0.83930399 0.84254709
0.82985258 0.83022732 0.82428835 0.83678067]
它可能会帮助像我这样的 sklearn 新手用户。欢迎建议/编辑和更好的答案。