我正在尝试删除df['句子']中的停用词,因为我需要绘制它。 我的样本是
... 13
London 12
holiday 11
photo 7
. 7
..
walk 1
dogs 1
我已经建立了自己的字典,我想用它来从该列表中删除stop_words。 我所做的如下:
import matplotlib.pyplot as plt
df['Sentences'] = df['Sentences'].apply(lambda x: ' '.join([item for item in x.split() if item not in my_dict]))
w_freq=df.Sentences.str.split(expand=True).stack().value_counts()
虽然它没有给我任何错误,但停用词和标点符号仍然存在。另外,我不想更改列,而只是查看结果以进行简短分析(例如,创建原始列的副本(。
如何删除它们?
假设您有这个数据帧,其中包含这个非常有趣的对话。
df = pd.DataFrame({'Sentences':['Hello, how are you?',
'Hello, I am fine. Have you watched the news',
'Not really the news ...']})
print (df)
Sentences
0 Hello, how are you?
1 Hello, I am fine. Have you watched the news
2 Not really the news ...
现在你想从my_dict
中删除标点符号和停用词,你可以这样做
my_dict = ['a','i','the','you', 'am', 'are', 'have']
s = (df['Sentences'].str.lower() #to prevent any case problem
.str.replace(r'[^ws]+', '') # remove the punctuation
.str.split(' ') # create a list of words
.explode() # create a row per word of the lists
.value_counts() # get occurrences
)
s = s[~s.index.isin(my_dict)] #remove the the stopwords
print (s) #you can see you don't have punctuation nor stopwords
news 2
hello 2
watched 1
fine 1
not 1
really 1
how 1
1
Name: Sentences, dtype: int64
这可能不是更快的方法