在Sklearn上训练多堂课

我有这样的数据框架。

我正在寻找一种训练此数据集的方法，因此我使用Sklearn使用此代码尝试了

train_x, test_x, train_y, test_y = train_test_split(df[['city','text']], df[['1','2','3','4']], test_size = 0.40, random_state = 21)
count_vect = CountVectorizer(analyzer='word', ngram_range=(2,3), max_features=20000)
count_vect.fit(df['text'])
x_train =  count_vect.transform(train_x)
x_test =  count_vect.transform(test_x)
classifier = DecisionTreeClassifier()
classifier.fit(x_train, train_y)

但是我有这样的错误

ValueError: Number of labels=2348 does not match number of samples=1

实际上我不知道直接使用其4个标签训练我的数据是否可以

错误是由于行引起的：

x_train =  count_vect.transform(train_x)

您看到，您的train_x和test_x有两个列（来自df[['city','text']]），但CountVectorizer仅与单列一起使用。它只需要单个字符串，而不是更多。因此，您是正确的：

count_vect.fit(df['text'])

由于您仅提供一个列。但是，当您在count_vect.transform(train_x)中提供train_x时，count_vect ONY取下列名，而不是实际数据。

也许您想要：

x_train = count_vect.transform(train_x['text'])

错误是因为输入X的形状应为 [n_samples, n_features]。如果您检查X的形状，则应为（2348，）。

转换X的最佳方法

X = X[:, np.newaxis]

相关内容

最新更新

热门标签：