如何从csv文件中删除停止字



目前我正在做一个分析Twitter数据的项目。我正处于预处理阶段,我正在努力让我的应用程序从数据集中删除停止词。

import pandas as pd
import json
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
self.file_name = filedialog.askopenfilename(initialdir='/Desktop',
title='Select file',
filetypes=(('csv file', '*.csv'),
('csv file', '*.csv')))
for tw in df["txt"]:
column_list = ["txt"]
clean_tw = []
df = pd.read_csv(self.file_name, usecols=column_list)
stop_words = set(stopwords.words('english'))
tw = (re.sub("([^0-9A-Za-z t])|(w+://S+(RT))", "", tw.lower()).split())
if tw not in stop_words:
filtered_tw = [w for w in tw if not w in stopwords.words('english')]
clean_tw.append(filtered_tw)

我现在得到错误:

Exception in Tkinter callback
Traceback (most recent call last):
File "...", line 1884, in __call__
return self.func(*args)
File "...", line 146, in clean_csv
if tweet not in stop_words:
TypeError: unhashable type: 'list'

您正在尝试检查列表(从正则表达式的结果)是否在一个集合中…此操作无法执行。您需要循环遍历列表(或执行某种集合操作,例如set(tw).difference(stop_words).

)。为清楚起见:

>>> tw = (re.sub("([^0-9A-Za-z t])|(w+://S+(RT))", "", initial.lower()).split())
>>> tw
['this', 'is', 'an', 'example']
>>> set(tw).difference(stop_words)
{'example'}

然后将差值附加到clean_tw:)类似于:

clean_tw = []
df = pd.read_csv(self.file_name, usecols=col_list)
stop_words = set(stopwords.words('english'))
tw = (re.sub("([^0-9A-Za-z t])|(w+://S+(RT))", "", tw.lower()).split())
clean_tw.append(set(tw).difference(stop_words))

最后,您可以在循环之外定义stop_words,因为它将始终是相同的集合,因此您可以提高性能:)

仅供参考,当有这么好的包时,您不应该使用regex删除停止词!

我建议使用nltk进行标记化和取消标记化。

对于csv中的每一行:

import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
# get your stopwords from nltk
stop_words = set(stopwords.words('english'))
# loop through your rows
for sent in sents:
# tokenize
tokenized_sent = nltk.word_tokenize(sent)
# remove stops
tokenized_sent_no_stops = [
tok for tok in tokenized_sent 
if tok not in stop_words
]
# untokenize 
untokenized_sent = TreebankWordDetokenizer().detokenize(
tokenized_sent_no_stops
)

根据错误信息,很可能tweet是一个列表,stop_words是一个集合或字典。

>>> tweet = ['a','b']
>>> stop_words = set('abcdefg')
>>> tweet not in stop_words
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

试试这个

if not stop_words.intersection(tweet):
...

if stop_words.isdisjoint(tweet):

最新更新