我正在尝试使用UCI机器学习存储库中的CRX数据集。这个特定的数据集包含一些不是连续变量的特征。因此,在将它们传递给SVM之前,我需要将它们转换为数值。
我最初考虑使用一个热解码器,它获取整数值并将其转换为矩阵(例如,如果一个特征有三个可能的值,"红色"、"蓝色"one_answers"绿色",则它将转换为三个二进制特征:"红色"为1,0,0,"蓝色"为0,1,0,"绿色"为0,0,1。这将非常适合我的需求,除了它只能处理整数特征这一事实。
def get_crx_data(debug=False):
with open("/Volumes/LocalDataHD/jt306/crx.data", "rU") as infile:
features_array = []
reader = csv.reader(infile,dialect=csv.excel_tab)
for row in reader:
features_array.append(str(row).translate(None,"[]'").split(","))
features_array = np.array(features_array)
print features_array.shape
print features_array[0]
labels_array = features_array[:,15]
features_array = features_array[:,:15]
print features_array.shape
print labels_array.shape
print("FeatureHasher on frequency dicts")
hasher = FeatureHasher(n_features=44)
X = hasher.fit_transform(line for line in features_array)
print X.shape
get_crx_data()
这将返回
Reading CRX data from disk
Traceback (most recent call last):
File"/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 38, in <module>
get_crx_data()
File "/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 32, in get_crx_data
X = hasher.fit_transform(line for line in features_array)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 426, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 129, in transform
_hashing.transform(raw_X, self.n_features, self.dtype)
File "_hashing.pyx", line 44, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1649)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 125, in <genexpr>
raw_X = (_iteritems(d) for d in raw_X)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems
return d.iteritems() if hasattr(d, "iteritems") else d.items()
AttributeError: 'numpy.ndarray' object has no attribute 'items'
(690, 16)
['0' ' 30.83' ' 0' ' u' ' g' ' w' ' v' ' 1.25' ' 1' ' 1' ' 1' ' 0' ' g'
' 202' ' 0' ' +']
(690, 15)
(690,)
FeatureHasher on frequency dicts
Process finished with exit code 1
How can I use feature hashing (or an alternative method) to convert this data from classes (some of which are strings, others are discrete numerical values) into data which can be handled by an SVM? I have also looked into using one-hot coding, but that only takes integers as input.
问题是FeatureHasher
对象希望每一行输入都有一个特定的结构——或者实际上是三种不同的可能结构之一。第一种可能性是feature_name:value
对的字典。第二个是(feature_name, value)
元组的列表。第三种是feature_name
s的平面列表。在前两种情况下,特征名称被映射到矩阵中的列,并且给定的值被存储在每行的这些列中。最后,列表中特征的存在或不存在被隐含地理解为True
或False
值。以下是一些简单、具体的例子:
>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
... non_negative=True,
... input_type='dict')
>>> X_new = hasher.fit_transform([{'a':1, 'b':2}, {'a':0, 'c':5}])
>>> X_new.toarray()
array([[ 1., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 5., 0., 0.]])
这说明了默认模式——如果不像原始代码中那样传递input_type
,FeatureHasher
将期望什么。正如您所看到的,预期的输入是字典列表,每个输入样本或数据行对应一个字典。每个字典都包含任意数量的功能名称,这些名称映射到该行的值。
输出X_new
包含阵列的稀疏表示;调用CCD_ 10返回作为普通CCD_。
如果要传递元组对,请传递input_type='pairs'
。然后你可以这样做:
>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
... non_negative=True,
... input_type='pair')
>>> X_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]])
>>> X_new.toarray()
array([[ 1., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 5., 0., 0.]])
最后,如果您只有布尔值,则根本不必显式传递值——FeatureHasher
将简单地假设,如果存在特征名称,则其值为True
(此处表示为浮点值1.0
)。
>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
... non_negative=True,
... input_type='string')
>>> X_new = hasher.fit_transform([['a', 'b'], ['a', 'c']])
>>> X_new.toarray()
array([[ 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])
不幸的是,您的数据似乎并不总是采用任何一种格式。但是,修改所需内容以适应'dict'
或'pair'
格式应该不会太难。如果你需要帮助,请告诉我;在这种情况下,请详细说明您试图转换的数据的格式。