当词干处理数据集时，意外的数据结束

我是python的新手，我正试图在Yelp的一小块上工作!使用pandas库和NLTK转换为CSV的数据集。

在进行数据预处理时，我首先尝试删除所有标点符号和最常见的停止词。在这样做之后，我想应用波特词干算法，该算法在nltk.stem中随时可用。

下面是我的代码:

"""A method for removing the noise in the data and the most common stop.words (NLTK)."""
def stopWords(review):
    stopset = set(stopwords.words("english"))
    review = review.lower()
    review = review.replace(".","")
    review = review.replace("-"," ")
    review = review.replace(")","")
    review = review.replace("(","")
    review = review.replace("i'm"," ")
    review = review.replace("!","")
    review = re.sub("[$!@#*;:<+>~-]", '', review)
    row = review.split()
    tokens = ' '.join([word for word in row if word not in stopset])
    return tokens

，我使用这里的令牌输入我写的词干提取方法:

"""A method for stemming the words to their roots using Porter Algorithm (NLTK)"""
def stemWords(impWords):
    stemmer = stem.PorterStemmer()
    tok = stopWords(impWords)
    ========================================================================
    stemmed = " ".join([stemmer.stem(str(word)) for word in tok.split(" ")])
    ========================================================================
    return stemmed

但是我得到一个错误UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data。在'=='里面的那行给了我错误。

我已经尝试清理数据并删除所有特殊字符!@#$^&*和其他使此工作。但是停顿词的效果很好。词干不能工作。谁能告诉我哪里做错了吗?

如果我的数据不干净，或者unicode字符串在某个地方坏了，我可以清理它或修复它，这样它就不会给我这个错误?我想做词干，任何建议都会有帮助。

阅读python中的unicode字符串处理。有str类型，但也有unicode类型。

我建议:

读取后立即解码每行，以缩小输入数据中不正确的字符(实际数据包含错误)
适用于unicode和u" "字符串。

有一个简单的方法可以过滤掉这些烦人的错误。您可以使用

预处理每个审阅

review = review.encode('ascii', errors='ignore')

删除所有无效字符。

相关内容

最新更新

热门标签：