Sklearn使用自然语言处理和数字数据



我正在使用sklearn进行项目,我有两列可以用来预测。一列是text,是一系列文章,另一列是equal_cnts,是实数。我正在尝试创建一个使用 SVM 训练文本和数字的模型,但我无法弄清楚如何使用这两个功能。

vect = CountVectorizer(ngram_range=(1, 2))
tfidf = TfidfTransformer()
svm = svm.SVC(kernel='linear', C = 100, gamma = 0.1)
text_clf = Pipeline([('vect', vect), ('tfidf', tfidf), ('svm', svm)])
scores = cross_val_score(text_clf, pd.concat([df['text'], df['equal_cnt']], axis = 1), df['empirical'], cv=10)

我目前正在尝试执行上述操作,其中管道旨在处理文本,并且模型正在测试df["empirical"]的准确性。

您可以将稀疏矩阵从 TFIDF 转换器转换为数据帧,然后简单地将数字列作为额外列分配给此数据帧。让我通过一个例子向您展示这一点:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd 
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
textMatDf = pd.DataFrame(X.toarray())
textMatDf['numCol'] = df['equal_cnt']

现在,此 textMatDf 可用于训练和验证。

我认为使用scikit-learn执行此操作的现代而流畅的方式将使用ColumnTransformer,如下所示:

from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
X = df.drop("empirical", axis=1)
y = df["empirical"]
preprocessor = ColumnTransformer(
# We apply a TF-IDF vectorizer to the "text" column
[("text", TfidfVectorizer(max_features=10), "text"),], 
# the "passthrough" value for the "remainder" parameter lets 'equal_cnt' pass 
# through the first stage of the pipeline without modifying this column
remainder="passthrough"  
)
classifier = SVC(kernel='linear', C=1., gamma='scale')
pipeline = make_pipeline(preprocessor, classifier)
scores = cross_val_score(pipeline, X, y, cv=10)

当然,您可以根据需要调整 TF-IDF 和 SVM 超参数。

最新更新