如何从具有标记化数据的数据帧中删除停止字



我正在尝试从数据帧中删除停止字。每一行只有一列名为text的列,我在其中存储了文章的所有段落。

这是我尝试的第一种方法

stopwords  = ['cat', 'dog', 'lion', 'fox']
df['text'] = df['text'].apply(lambda x: str.split(x))
df['text'] = df['text'].apply(lambda x: [item for item in x if item.lower() not in stop_words])
x=0
for i in df['text']:
df['text'][x] = ' '.join(i)
x += 1

df

奇怪的是,这并没有从df['text']中删除所有停止字中的单词。我不明白为什么,所以我转向了标记化。标记化后,每个段落都被分割并形成列。

从这个数据帧中,有些行的列数超过50000,我如何删除停止字中的字?

谢谢

您可以尝试以下操作:

import pandas as pd
def remove_stop_words(sentence):
stop_words  = ['cat', 'dog', 'lion', 'fox']
word_list=sentence.split()
clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
return(clean_sentence)


data = {'text':['the LION eat the cat','the dog is pretty','this Fox looks like a dog','there is no stop word here']}
df = pd.DataFrame(data)
#remove stopword
df['text'] = df['text'].apply(remove_stop_words)

结果:

text
0                 the eat the
1               the is pretty
2           this looks like a
3  there is no stop word here

另一个解决方案可以是pandas.str.replace,但它可以连续创建许多空间:

data = {'text':['the LION eat the cat','the dog is pretty','this Fox looks like a dog','there is no stop word here']}
df = pd.DataFrame(data)
stop_words  = ['cat', 'dog', 'lion', 'fox']
for stop in stop_words:
df['text']=df['text'].str.replace(stop,'',case=False)

结果:

text
0               the  eat the
1              the  is pretty
2         this  looks like a
3  there is no stop word here

更新:您可以使用Regex查找所有以停止词开头的单词:

import pandas as pd
import re
def remove_stop_words(sentence):
stop_words  = ['cat', 'dog', 'lion', 'fox']
for stop_word in stop_words:
#if you want to exclude only words with string with stop words + 1 letters => Lions
stop_words.extend(re.findall(r'b'+stop_word+'[a-zA-Z]*w+', sentence.lower()))
#if you want to exclude only words starting with stop words  => Lions,Lionsss
regex = r'b(#w*[^#W])b'.replace('#', stop_word)
stop_words.extend(re.findall(regex, sentence.lower(), re.I))
word_list=sentence.split()
clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
return(clean_sentence)


data = {'text':['the LIONsss eat the cats','the dogs is pretty','this Fox looks like a dog','there is no stop word here','lionz is not the plurial of lion']}
df = pd.DataFrame(data)
print(df)
#remove stopword
df['text'] = df['text'].apply(remove_stop_words)
print(df)

最新更新