sci-kit学习:使用SelectKBest时识别相应的特征id值

我使用sci kit learn(版本0.11，Python版本2.7.3)从svmlight格式的二进制分类数据集中选择前K个特征。

我正在尝试识别所选功能的功能id值。我以为这会很简单——很可能是！(这里所说的特征id，是指特征值之前的数字)

下面的代码说明了我是如何做到这一点的：

from sklearn.datasets import load_svmlight_file
from sklearn.feature_selection import SelectKBest
svmlight_format_train_file = 'contrived_svmlight_train_file.txt' #I present the contents of this file below
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file)
featureSelector = SelectKBest(score_func=chi2,k=2)
featureSelector.fit(X_train_data,Y_train_data)
assumed_to_be_the_feature_ids_of_the_top_k_features = list(featureSelector.get_support(indices=True)) #indices=False just gives me a list of True,False etc...
print assumed_to_be_the_feature_ids_of_the_top_k_features #this gives: [0, 2]

显然，assumed_to_be_the_feature_ids_of_the_top_k_features不能对应于特征id值，因为(见下文)我的输入文件中的特征id值从1开始。

现在，我怀疑assumed_to_be_the_feature_ids_of_the_top_k_features实际上可能对应于按值递增顺序排序的特征id值的列表索引。在我的情况下，索引0将对应于feature-id=1等，因此代码告诉我feature-id=1和feature-id=3已被选中。

不过，如果有人能证实或否认这一点，我将不胜感激。

提前谢谢。

managed_svmlight_train_file.txt的内容:

1 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1 1:1.000000 2:1.000000#mB
0 5:1.000000#mC
1 1:1.000000 2:1.000000#mD
0 3:1.000000 4:1.000000#mE
0 3:1.000000#mF
0 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
0 2:1.000000#mH

p.S.对格式不正确表示歉意(第一次在这里)；我希望这是清晰易懂的！

很明显，assumed_to_be_the_feature_ids_of_the_top_k_features不能与特征id值相对应，因为(见下文)我的输入文件中的特征id值从1开始。

实际上，它们是。SVMlight格式加载程序将检测到您的输入文件具有基于一的索引，并将从每个索引中减去一，以免浪费一列。如果这不是您想要的，那么将zero_based=True传递给load_svmlight_file，以假装它实际上是基于零的，并插入一个额外的列；有关详细信息，请参阅其文档。

相关内容

最新更新

热门标签：