DictVectorizer问题:为不同的输入创建不同数量的功能

我正在尝试编写一个机器学习算法，在该算法中，我试图预测输出是+50000还是-50000。在这样做的过程中，我使用随机森林分类器使用了11个字符串特征。但是由于随机森林分类器需要float/numbers形式的输入，我使用DictVectorizer将字符串特征转换为float/nnumbers。但对于数据中的不同行，DictVectorizer会创建不同数量的特征(240-260)。这导致了预测模型输出的错误。一个示例输入行是：

{'detailed household summary in household': ' Spouse of householder',
'tax filer stat': ' Joint both under 65',
'weeks worked in year': ' 52',
'age': '32', 
'sex': ' Female',
'marital status': ' Married-civilian spouse present',
'full or part time employment stat': ' Full-time schedules',
'detailed household and family stat': ' Spouse of householder', 
'education': ' Bachelors degree(BA AB BS)',
'num persons worked for employer': ' 3',
'major occupation code': ' Adm support including clerical'}

有没有什么方法可以转换输入，这样我就可以使用随机森林分类器来预测输出。

编辑：我使用的代码是：

X,Y=[],[]
features=[0,4,7,9,12,15,19,22,23,30,39]
with open("census_income_learn.csv","r") as fl:
reader=csv.reader(fl)
for row in reader:
data={}
for i in features:
data[columnNames[i]]=str(row[i])
X.append(data)
Y.append(str(row[41]))
X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)
vec = DictVectorizer()
X_train=vec.fit_transform(X_train).toarray()
X_validate=vec.fit_transform(X_validate).toarray()
print("data ready")
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( X_train, Y_train )
print("model created")
Y_predicted=forest.predict(X_validate)
print(Y_predicted)

因此，在这里，如果我尝试打印训练集和验证集的第一个元素，我在X_train[0]中得到252个特征，而在X_validate[0]中有249个特征。

试试这个：

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
cols = [0,4,7,9,12,15,19,22,23,30,39,  41]
names = [
'detailed household summary in household',
'sex',
'full or part time employment stat',
'age',
'detailed household and family stat',
'weeks worked in year',
'num persons worked for employer',
'major occupation code',
'tax filer stat',
'education',
'marital status',
'TARGET'
]
fn = r'D:temp.datacensus_income_learn.csv'
data = pd.read_csv(fn, header=None, usecols=cols, names=names)
# http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn    
df = data.apply(LabelEncoder().fit_transform)
X, Y = np.split(df, [11], axis=1)
X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( X_train, Y_train )
Y_predicted=forest.predict(X_validate)

相关内容

最新更新

热门标签：