我是数据科学和NLP的新手。我想对一些文本文档执行TF_IDF矢量化,然后在使用结果训练不同的机器学习模型后。但是当我尝试训练 SVC 模型时,我得到了 ValueError:使用序列设置数组元素。这是我的代码。
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, stop_words='english')
df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])
X = df['vect_message']
y = df['severity']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn import svm
model = svm.SVC()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
我在网上遇到了一个错误model.fit(X_train, y_train)
我已经搜索了其他类似的问题,我找到了一个他们建议使用.toarray()
方法将稀疏矩阵转换为 np.array 的问题。但这对我没有帮助。
执行以下行时:
df['vect_message'] = vectorizer.fit_transform(df['message_encoding'])
熊猫将vectorizer.fit_transform()
的结果视为标量对象。因此,您将在vect_message
列的每一行中使用相同的稀疏矩阵:
In [74]: df.loc[0, 'vect_message']
Out[74]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [75]: df.loc[0, 'vect_message'].A
Out[75]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
In [76]: df.loc[1, 'vect_message'].A
Out[76]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
In [77]: df.loc[2, 'vect_message'].A
Out[77]:
array([[ 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0.70710678, 0.70710678],
[ 1. , 0. , 0. , 0. ]])
当我们做df['new_col'] = 0
时,基本上也会发生同样的事情 - 我们将有一列zeros
解决方法:
X = vectorizer.fit_transform(df['message_encoding'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
PS IMO 保存(尝试保存(2D 稀疏矩阵(熊猫列(系列(中调用vectorizer.fit_transform()
的结果 - 1D 结构