值错误多项式NB的样本数不一致错误

>我需要创建一个模型，根据变量对记录进行准确分类。例如，如果一条记录具有预测因子A或B，我希望它被归类为具有预测值X。实际数据采用以下形式：

Predicted    Predictor
X            A
X            B
Y            D
X            A

对于我的解决方案，我执行以下操作： 1. 用于LabelEncoder为Predicted列创建数值 2. 预测变量有多个类别，我使用get_dummies将它们解析为单独的列。

下面是数据帧的一个子部分，其中包含(虚拟(Predictor和几个预测变量类别(请原谅错位(：

Predicted Predictor_A    Predictor_B
9056    30  0   0
2482    74  1   0
3407    56  1   0
12882   15  0   0
7988    30  0   0
13032   12  0   0
9738    28  0   0
6739    40  0   0
373 131 0   0
3030    62  0   0
8964    30  0   0
691 125 0   0
6214    41  0   0
6438    41  1   0
5060    42  0   0
3703    49  0   0
12461   16  0   0
2235    75  0   0
5107    42  0   0
4464    46  0   0
7075    39  1   0
11891   16  0   0
9190    30  0   0
8312    30  0   0
10328   24  0   0
1602    97  0   0
8804    30  0   0
8286    30  0   0
6821    40  0   0
3953    46  1

如上所示将数据重塑为 datframe 后，我尝试使用sklearn中的MultinomialNB。这样做时，我遇到的错误是：

ValueError: Found input variables with inconsistent numbers of samples: [1, 8158]

我在尝试使用只有 2 列的数据帧时遇到了错误 ->Predicted和Predictor_A

我的问题是：

我需要做什么来解决错误？
我的方法正确吗？

要拟合MultinomialNB模型，您需要训练样本及其特征及其相应的标签(目标值(。
在您的情况下，Predicted是target变量，Predictor_A and Predictor_B是features变量(预测变量(。

示例 1：

from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("dt.csv", delim_whitespace=True)
# X is the features
X = df[['Predictor_A','Predictor_B']]
#y is the labels or targets or classes 
y = df['Predicted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf.fit(X_train, y_train)
clf.predict(X_test)
#array([30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30])
#this result makes sense if you look at X_test. all the samples are similar
print(X_test)
Predictor_A  Predictor_B
8286             0            0
12461            0            0
6214             0            0
9190             0            0
373              0            0
3030             0            0
11891            0            0
9056             0            0
8804             0            0
6438             1            0
#get the probabilities 
clf.predict_proba(X_test)

注2：我使用的数据可以在这里找到

编辑

如果使用某些具有 4 个标记(预测因子(的文档来训练模型，则要预测的新文档也应具有相同数量的标记。

示例 2：

clf.fit(X, y)

在这里，X是一个[29, 2]数组。所以我们有29训练样本(文档(，它有2标签(预测因子(

clf.predict(X_new)

在这里，X_new可能是[n, 2].因此，我们可以预测n新文档上的类，但这些新文档也应该具有恰好2标签(预测因子(。

编辑

相关内容

最新更新

热门标签：

值错误 多项式NB的样本数不一致错误

编辑

相关内容

最新更新

热门标签：

值错误多项式NB的样本数不一致错误