WEKA实例预测和混淆矩阵结果之间的差异

我对数据挖掘并不陌生，所以对WEKA的结果完全感到困惑。希望得到一些帮助。提前感谢！

我有一个数字向量的数据集，它有一个二进制分类(S，H)。我训练了一个NaiveBayes模型(尽管方法真的不重要)，在留一交叉验证中。结果如下：

=== Predictions on test data ===
inst#     actual  predicted error distribution
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        2:S   +   0,*1
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *1,0
1        1:H        1:H       *0.997,0.003
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1 
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1 
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        1:H   +   *1,0
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        2:S       0,*1
1        2:S        1:H   +   *1,0
=== Stratified cross-validation ===
=== Summary ===
Total Number of Instances               66
=== Confusion Matrix ===
a  b   <-- classified as
14  1 |  a = H
2 49 |  b = S

正如您所看到的，在输出和混淆矩阵中都有三个错误。然后，我使用具有相同属性和相同两个类的独立数据集重新评估模型。结果如下：

=== Re-evaluation on test set ===
User supplied test set
Relation:     FCBC_New.TagProt
Instances:     unknown (yet). Reading incrementally
Attributes:   355
=== Predictions on user test set ===
inst#     actual  predicted error distribution
1        1:S        2:H   +   0,*1
2        1:S        1:S       *1,0
3        1:S        2:H   +   0,*1
4        2:H        1:S   +   *1,0
5        2:H        2:H       0,*1
6        1:S        2:H   +   0,*1
7        1:S        2:H   +   0,*1
8        2:H        2:H       0,*1
9        1:S        1:S       *1,0
10        1:S        2:H   +   0,*1
11        1:S        2:H   +   0,*1
12        2:H        1:S   +   *1,0
13        2:H        2:H       0,*1
14        1:S        2:H   +   0,*1
15        1:S        2:H   +   0,*1
16        1:S        2:H   +   0,*1
17        2:H        2:H       0,*1
18        2:H        2:H       0,*1
19        1:S        2:H   +   0,*1
20        1:S        2:H   +   0,*1
21        1:S        2:H   +   0,*1
22        1:S        1:S       *1,0
23        1:S        2:H   +   0,*1
24        1:S        2:H   +   0,*1
25        2:H        1:S   +   *1,0
26        1:S        2:H   +   0,*1
27        1:S        1:S       *1,0
28        1:S        2:H   +   0,*1
29        1:S        2:H   +   0,*1
30        1:S        2:H   +   0,*1
31        1:S        2:H   +   0,*1
32        1:S        2:H   +   0,*1
33        1:S        2:H   +   0,*1
34        1:S        1:S       *1,0
35        2:H        1:S   +   *1,0
36        1:S        2:H   +   0,*1
37        1:S        1:S       *1,0
38        1:S        1:S       *1,0
39        2:H        1:S   +   *1,0
40        1:S        2:H   +   0,*1
41        1:S        2:H   +   0,*1
42        1:S        2:H   +   0,*1
43        1:S        2:H   +   0,*1
44        1:S        2:H   +   0,*1
45        1:S        2:H   +   0,*1
46        1:S        2:H   +   0,*1
47        2:H        1:S   +   *1,0
48        1:S        2:H   +   0,*1
49        2:H        1:S   +   *1,0
50        2:H        1:S   +   *1,0
51        1:S        2:H   +   0,*1
52        1:S        2:H   +   0,*1
53        2:H        1:S   +   *1,0
54        1:S        2:H   +   0,*1
55        1:S        2:H   +   0,*1
56        1:S        2:H   +   0,*1
=== Summary ===
Correctly Classified Instances          44               78.5714 %
Incorrectly Classified Instances        12               21.4286 %
Kappa statistic                          0.4545
Mean absolute error                      0.2143
Root mean squared error                  0.4629
Coverage of cases (0.95 level)          78.5714 %
Total Number of Instances               56
=== Detailed Accuracy By Class ===
TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.643    0.167    0.563      0.643    0.600      0.456    0.828     0.566     H
0.833    0.357    0.875      0.833    0.854      0.456    0.804     0.891     S
Weighted Avg.    0.786    0.310    0.797      0.786    0.790      0.456    0.810     0.810
=== Confusion Matrix ===
a  b   <-- classified as
9  5 |  a = H
7 35 |  b = S

这就是我的问题所在。输出清楚地表明存在许多错误。事实上，有44个。另一方面，混淆矩阵和结果摘要表明存在12个错误。现在，如果预测类被反转，则混淆矩阵将为真。现在，我看一下分数的分布，我看到在交叉验证结果中，逗号前的值表示H类，第二个值是S类(所以值1,0表示H预测)。然而，在测试结果中，这些是相反的，值1,0表示S预测。所以，如果我取分数分布，混淆矩阵是正确的。如果我采用预测(H或S)，则混淆矩阵是错误的。我尝试将所有测试文件类更改为H或S。这不会改变输出结果或混淆矩阵总数：在混淆矩阵中，16个实例总是预测为a(H)，40个实例总是b(S)，即使纯文本输出实际上是16个b(S)和40个a(H)。有什么问题吗？这一定是一件简单的事情，但我完全不知所措。。。

如果你能看看这个关于实例分类的weka教程会更好http://preciselyconcise.com/apis_and_installations/training_a_weka_classifier_in_java.php希望能有所帮助。本教程还涉及二元分类(正、负)。

相关内容

最新更新

热门标签：