尝试使用加入/拆分删除我自己的停用词



我正在尝试删除df['句子']中的停用词,因为我需要绘制它。 我的样本是

...        13
London     12
holiday    11
photo      7
.          7
..
walk       1
dogs       1

我已经建立了自己的字典,我想用它来从该列表中删除stop_words。 我所做的如下:

import matplotlib.pyplot as plt

df['Sentences'] = df['Sentences'].apply(lambda x: ' '.join([item for item in x.split() if item not in my_dict]))
w_freq=df.Sentences.str.split(expand=True).stack().value_counts()

虽然它没有给我任何错误,但停用词和标点符号仍然存在。另外,我不想更改列,而只是查看结果以进行简短分析(例如,创建原始列的副本(。

如何删除它们?

假设您有这个数据帧,其中包含这个非常有趣的对话。

df = pd.DataFrame({'Sentences':['Hello, how are you?', 
'Hello, I am fine. Have you watched the news', 
'Not really the news ...']})
print (df)
Sentences
0                          Hello, how are you?
1  Hello, I am fine. Have you watched the news
2                      Not really the news ...

现在你想从my_dict中删除标点符号和停用词,你可以这样做

my_dict = ['a','i','the','you', 'am', 'are', 'have']
s = (df['Sentences'].str.lower() #to prevent any case problem
.str.replace(r'[^ws]+', '') # remove the punctuation
.str.split(' ') # create a list of words
.explode() # create a row per word of the lists
.value_counts() # get occurrences
)
s = s[~s.index.isin(my_dict)] #remove the the stopwords
print (s) #you can see you don't have punctuation nor stopwords
news       2
hello      2
watched    1
fine       1
not        1
really     1
how        1
1
Name: Sentences, dtype: int64

这可能不是更快的方法

相关内容

最新更新