无法在朴素贝叶斯中训练模型



我正在尝试使用NLTK 将电子邮件分类为垃圾邮件/垃圾邮件

以下是以下步骤:

  1. 尝试提取所有代币

  2. 获取的所有功能

  3. 从所有独特单词的语料库中提取特征并进行映射真/假

  4. Naive Bayes分类器中的数据训练

from nltk.classify.util import apply_features
from nltk import NaiveBayesClassifier
import pandas as pd
import collections
from sklearn.model_selection import train_test_split
from collections import Counter
data = pd.read_csv('https://raw.githubusercontent.com/venkat1017/Data/master/emails.csv')
"""fetch array of tuples where each tuple is defined by (tokenized_text, label)
"""
processed_tokens=data['text'].apply(lambda x:([x for x in x.split() if x.isalpha()]))
processed_tokens=processed_tokens.apply(lambda x:([x for x in x if len(x)>3]))
processed_tokens = [(i,j) for i,j in zip(processed_tokens,data['spam'])]

"""
dictword return a Set of unique words in complete corpus.
"""
list = zip(*processed_tokens)
dictionary = Counter(word for i, j in processed_tokens for word in i)
dictword = [word for word, count in dictionary.items() if count == 1]

"""maps each input text into feature vector"""
y_dict = ( [ (word, True) for word in dictword] )
feature_vec=dict(y_dict)
"""Training"""
training_set, testing_set = train_test_split(y_dict, train_size=0.7)
classifier = NaiveBayesClassifier.train(training_set)

~AppDataLocalContinuumanaconda3libsite-packagesnltkclassifynaivebayes.py in train(cls, labeled_featuresets, estimator)
197         for featureset, label in labeled_featuresets:
198             label_freqdist[label] += 1
--> 199             for fname, fval in featureset.items():
200                 # Increment freq(fval|label, fname)
201                 feature_freqdist[label, fname][fval] += 1
AttributeError: 'str' object has no attribute 'items'

我在尝试训练独特单词语料库时遇到了以下错误

首先,我希望您知道y_dict只是一个字典,它将语料库中只出现过一次的单词(字符串(映射为值True的键。您将其作为训练集传递给分类器,而您应该传递(每个文本行的特征dict(和(相应标签(的tuple。虽然您的分类器应该接收[({'feat1': 'value1', ... }, label_value), ...]作为输入,但您正在传递[ ('word1', True), ... ]string类型没有items属性,只有dict具有。因此出现了错误。

其次,你的数据建模是错误的。您的训练集应该包含一个从data['text']派生的特征dict,该特征dict映射到data['spam']值(因为这是您的标签(。请在这里的1.3节中查看如何使用nltk的分类器执行文档分类。

最新更新