ValueError 同时使用 scikit-learn python 的线性 SVM



我目前正在研究ODP文档的大规模分层文本分类。提供给我的数据集是libSVM格式。我正在尝试运行python的scikit-learn的线性内核SVM来开发模型。以下是来自训练样本的示例数据:

29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3 
33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1 

以下是我用于构造线性 SVM 模型的代码

from sklearn.datasets import load_svmlight_file
from sklearn import svm
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt")
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
print clf.score(X_test,y_test)

运行 clf.score() 时,出现以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-b285fbfb3efe> in <module>()
      1 start_time = time.time()
----> 2 print clf.score(X_test,y_test)
      3 print time.time() - start_time, "seconds"
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
    292         """
    293         from .metrics import accuracy_score
--> 294         return accuracy_score(y, self.predict(X))
    295 
    296 
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    464             Class labels for samples in X.
    465         """
--> 466         y = super(BaseSVC, self).predict(X)
    467         return self.classes_.take(y.astype(np.int))
    468 
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    280         y_pred : array, shape (n_samples,)
    281         """
--> 282         X = self._validate_for_predict(X)
    283         predict = self._sparse_predict if self._sparse else self._dense_predict
    284         return predict(X)
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)
    402             raise ValueError("X.shape[1] = %d should be equal to %d, "
    403                              "the number of features at training time" %
--> 404                              (n_features, self.shape_fit_[1]))
    405         return X
    406 
ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

有人可以让我知道这段代码或我拥有的数据到底出了什么问题吗?提前致谢

下面附上了X_train、y_train、X_test和y_test的值:

X_train:

  (0, 9453)         1.0
  (0, 11741)    1.0
  (0, 18883)    14.0
  (0, 26839)    1.0
  (0, 35146)    1.0
  (0, 52781)    1.0
  (0, 72082)    1.0
  (0, 73243)    1.0
  (0, 78944)    1.0
  (0, 79912)    1.0
  (0, 79985)    1.0
  (0, 86709)    3.0
  (0, 117285)   1.0
  (0, 139819)   1.0
  (0, 142457)   1.0
  (0, 146314)   1.0
  (0, 151004)   2.0
  (0, 161453)   3.0
  (0, 172236)   1.0
  (0, 187531)   2.0
  (0, 202462)   1.0
  (0, 210417)   1.0
  (0, 250581)   1.0
  (0, 251689)   1.0
  (0, 296384)   2.0
  : :
  (4462, 735469)    1.0
  (4462, 737059)    15.0
  (4462, 740127)    1.0
  (4462, 743798)    1.0
  (4462, 766063)    1.0
  (4462, 778958)    2.0
  (4462, 784004)    4.0
  (4462, 837264)    2.0
  (4462, 839095)    22.0
  (4462, 844735)    6.0
  (4462, 859721)    2.0
  (4462, 875267)    1.0
  (4462, 910761)    1.0
  (4462, 931244)    1.0
  (4462, 945069)    6.0
  (4462, 948728)    1.0
  (4462, 948850)    2.0
  (4462, 957682)    1.0
  (4462, 975170)    1.0
  (4462, 989192)    1.0
  (4462, 1014294)   1.0
  (4462, 1042424)   1.0
  (4462, 1049027)   1.0
  (4462, 1072931)   1.0
  (4462, 1145790)   1.0

y_train:

[  2.90000000e+01   3.30000000e+01   3.30000000e+01 ...,   1.65475000e+05
   1.65518000e+05   1.65518000e+05]

X_test:

  (0, 18573)    1.0
  (0, 23501)    1.0
  (0, 29954)    1.0
  (0, 42112)    1.0
  (0, 46402)    1.0
  (0, 63041)    2.0
  (0, 67942)    2.0
  (0, 83522)    1.0
  (0, 88413)    2.0
  (0, 99454)    1.0
  (0, 126041)   1.0
  (0, 139819)   1.0
  (0, 142678)   1.0
  (0, 151004)   1.0
  (0, 166351)   2.0
  (0, 173794)   1.0
  (0, 192162)   3.0
  (0, 210417)   2.0
  (0, 254468)   1.0
  (0, 263895)   2.0
  (0, 277567)   1.0
  (0, 278419)   2.0
  (0, 279181)   2.0
  (0, 281319)   2.0
  (0, 298898)   1.0
  : :
  (1857, 1100504)   3.0
  (1857, 1103247)   1.0
  (1857, 1105578)   1.0
  (1857, 1108986)   2.0
  (1857, 1118486)   1.0
  (1857, 1120807)   9.0
  (1857, 1129243)   2.0
  (1857, 1131786)   1.0
  (1857, 1134029)   2.0
  (1857, 1134410)   5.0
  (1857, 1134494)   1.0
  (1857, 1139045)   25.0
  (1857, 1142239)   3.0
  (1857, 1142651)   1.0
  (1857, 1144787)   1.0
  (1857, 1151891)   1.0
  (1857, 1152094)   1.0
  (1857, 1157533)   1.0
  (1857, 1159376)   1.0
  (1857, 1178944)   1.0
  (1857, 1181310)   2.0
  (1857, 1182023)   1.0
  (1857, 1187098)   1.0
  (1857, 1194344)   2.0
  (1857, 1195819)   9.0

y_test:

[  2.90000000e+01   3.30000000e+01   1.56000000e+02 ...,   1.65434000e+05
   1.65475000e+05   1.65518000e+05]

错误消息

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

解释自身:与用于训练模型的训练数据相比,测试数据中的特征数量不同。也就是说,X_train.shape[1]不等于X_test.shape[1].

您应该检查为什么它们不相等,因为它们应该是。

一种可能性是它们被加载为稀疏矩阵,特征的数量由 load_svmlight_file 推断。如果测试数据包含训练数据看不到的特征,则生成的X_test可能具有更大的维度。为避免这种情况,可以通过传递参数 n_features 来指定load_svmlight_file中的要素数。

您可以使用

n_features选项。

X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt", n_features=X_train.shape[1])

此错误也可以通过使用load_svmlight_files

来解决
from sklearn.datasets import load_svmlight_files
X_train, y_train, X_test, y_test = load_svmlight_files(['/path-to-file/train.txt', '/path-to-file/test.txt'])

predict()函数需要 2D 数组中的值,但X_train.data[4]在 1D 数组中。您可以简单地添加数组括号(例如。 [X_train.data[4]] ) 将 1D 数组转换为 2D 数组

print(clf.predict([X_train.data[4]]))

发现问题!!

# -*- coding:utf-8 -*-
  1. 文件应使用 utf-8 编码
  2. 应改变数据框对象的形状。喜欢X_train.values[4].reshape(1, -1)

就我而言,这是通过删除已经创建的模型来解决的。如果在训练期间使用 --fixed_model_name 选项,则可能会发生这种情况。假设训练数据或数据格式(在我的情况下,它既是 - data AND md 到 json)更改了 ==>它创建模型没有任何问题,但是当我们发布查询时,rasa 错误并显示此消息。

最新更新