sklearn的LabelBinarizer可以像DictVectorizer一样吗？

我有一个数据集，其中包括数字和分类特征，其中分类特征可以包含标签列表。例如：

RecipeId   Ingredients    TimeToPrep
1          Flour, Milk    20
2          Milk           5
3          Unobtainium    100

如果我每个配方只有一个Ingeredit，DictVecorizer就会优雅地处理对适当的伪变量的编码：

from sklearn feature_extraction import DictVectorizer
RecipeData=[{'RecipeID':1,'Ingredients':'Flour','TimeToPrep':20}, {'RecipeID':2,'Ingredients':'Milk','TimeToPrep':5}
,{'RecipeID':3,'Ingredients':'Unobtainium','TimeToPrep':100}
dc=DictVectorizer()
dc.fit_transform(RecipeData).toarray()

输出：

array([[   1.,    0.,    0.,    1.,   20.],
[   0.,    1.,    0.,    2.,    5.],
[   0.,    0.,    1.,    3.,  100.]])

整数特征被正确处理，而分类标签被编码为布尔特征。

然而，DictVectorizer无法处理上的列表值功能和阻塞

配方数据=[｛‘配方ID’：1，‘配料’：[‘面粉’，‘牛奶’]，‘准备时间’：20｝，｛‘配料ID’：2，‘配料’，‘奶’：5｝，｛‘RecipeID’：3，‘Ingredients’：‘Unobtainium’，‘TimeToPrep’：100｝

LabelBinarizer正确处理了这一点，但分类变量必须单独提取和处理：

from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
lb.fit_transform([('Flour','Milk'), ('Milk',), ('Unobtainium',)])
array([[1, 1, 0],
[0, 1, 0],
[0, 0, 1]])

这就是我目前的做法-从混合数字/分类输入数组中提取包含标签列表的分类特征，用LabelBinarizer进行转换，然后将数字特征粘回。

有更优雅的方法吗？

LabelBinarizer适用于类标签，而不是特征(尽管通过正确的按摩，它也可以处理类别特征)。

DictVectorizer的预期用途是将特定于数据的函数映射到样本上，以提取有用的特征，该函数返回dict。因此，解决这一问题的优雅方法是编写一个函数，使您的特征dicts变平，并用值为True:的单个特征替换列表

>>> def flatten_ingredients(d):
...     # in-place version
...     if isinstance(d.get('Ingredients'), list):
...         for ingredient in d.pop('Ingredients'):
...             d['Ingredients=%s' % ingredient] = True
...     return d
... 
>>> RecipeData=[{'RecipeID':1,'Ingredients':['Flour','Milk'],'TimeToPrep':20}, {'RecipeID':2,'Ingredients':'Milk','TimeToPrep':5} ,{'RecipeID':3,'Ingredients':'Unobtainium','TimeToPrep':100}]
>>> map(flatten_ingredients, RecipeData)
[{'Ingredients=Milk': True, 'RecipeID': 1, 'TimeToPrep': 20, 'Ingredients=Flour': True}, {'RecipeID': 2, 'TimeToPrep': 5, 'Ingredients': 'Milk'}, {'RecipeID': 3, 'TimeToPrep': 100, 'Ingredients': 'Unobtainium'}]

行动中：

>>> from sklearn.feature_extraction import DictVectorizer
>>> dv = DictVectorizer()
>>> dv.fit_transform(flatten_ingredients(d) for d in RecipeData).toarray()
array([[   1.,    1.,    0.,    1.,   20.],
[   0.,    1.,    0.,    2.,    5.],
[   0.,    0.,    1.,    3.,  100.]])
>>> dv.feature_names_
['Ingredients=Flour', 'Ingredients=Milk', 'Ingredients=Unobtainium', 'RecipeID', 'TimeToPrep']

(如果我是你，我也会删除RecipeID，因为它不太可能是一个有用的功能，而且它很容易导致过拟合。)

相关内容

最新更新

热门标签：