NLP工作流程



NLP中文本数据的预处理和矩阵创建应该在train_test_split之前还是之后?下面是我的示例代码,在train_test_split之前我已经完成了预处理和矩阵创建(tfidf)。我想知道是否会有数据泄露?

corpus = []
for i in range(0 ,len(data1)):
review = re.sub('[^a-zA-Z]', ' ', data1['features'][i])
review = review.lower()
review = review.split()
review = [stemmer.stem(j) for j in review if not j in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features = 6000)
x = cv.fit_transform(corpus).toarray()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(data1['label'])
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, 
          stratify = y)
spam_model = MultinomialNB().fit(train_x, train_y)
pred = spam_model.predict(test_x)
c_matrix = confusion_matrix(test_y, pred)
acc_score = accuracy_score(test_y, pred)

正如官方文档中提到的TfidfVectorizer类与max_features参数只保留k-最佳特征。

max_featuresint默认=没有

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

如果您使用测试集呈现类,它将有助于更有效地选择该特性,这是数据泄漏(该场景基于您的问题,但在大多数情况下,它可以看到!)。机器学习中最安全的方法是忽略测试集,直到预测/评估,认为它就像不存在一样!

(更新)你可以在这里看到一个来自kaggle的例子,它在预分割数据集上使用矢量化器!更多关于这个概念提到这里和这里!

相关内容

  • 没有找到相关文章

最新更新