我正在尝试从数据帧中删除停止字。每一行只有一列名为text的列,我在其中存储了文章的所有段落。
这是我尝试的第一种方法
stopwords = ['cat', 'dog', 'lion', 'fox']
df['text'] = df['text'].apply(lambda x: str.split(x))
df['text'] = df['text'].apply(lambda x: [item for item in x if item.lower() not in stop_words])
x=0
for i in df['text']:
df['text'][x] = ' '.join(i)
x += 1
df
奇怪的是,这并没有从df['text']
中删除所有停止字中的单词。我不明白为什么,所以我转向了标记化。标记化后,每个段落都被分割并形成列。
从这个数据帧中,有些行的列数超过50000,我如何删除停止字中的字?
谢谢
您可以尝试以下操作:
import pandas as pd
def remove_stop_words(sentence):
stop_words = ['cat', 'dog', 'lion', 'fox']
word_list=sentence.split()
clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
return(clean_sentence)
data = {'text':['the LION eat the cat','the dog is pretty','this Fox looks like a dog','there is no stop word here']}
df = pd.DataFrame(data)
#remove stopword
df['text'] = df['text'].apply(remove_stop_words)
结果:
text
0 the eat the
1 the is pretty
2 this looks like a
3 there is no stop word here
另一个解决方案可以是pandas.str.replace,但它可以连续创建许多空间:
data = {'text':['the LION eat the cat','the dog is pretty','this Fox looks like a dog','there is no stop word here']}
df = pd.DataFrame(data)
stop_words = ['cat', 'dog', 'lion', 'fox']
for stop in stop_words:
df['text']=df['text'].str.replace(stop,'',case=False)
结果:
text
0 the eat the
1 the is pretty
2 this looks like a
3 there is no stop word here
更新:您可以使用Regex查找所有以停止词开头的单词:
import pandas as pd
import re
def remove_stop_words(sentence):
stop_words = ['cat', 'dog', 'lion', 'fox']
for stop_word in stop_words:
#if you want to exclude only words with string with stop words + 1 letters => Lions
stop_words.extend(re.findall(r'b'+stop_word+'[a-zA-Z]*w+', sentence.lower()))
#if you want to exclude only words starting with stop words => Lions,Lionsss
regex = r'b(#w*[^#W])b'.replace('#', stop_word)
stop_words.extend(re.findall(regex, sentence.lower(), re.I))
word_list=sentence.split()
clean_sentence=' '.join([w for w in word_list if w.lower() not in stop_words])
return(clean_sentence)
data = {'text':['the LIONsss eat the cats','the dogs is pretty','this Fox looks like a dog','there is no stop word here','lionz is not the plurial of lion']}
df = pd.DataFrame(data)
print(df)
#remove stopword
df['text'] = df['text'].apply(remove_stop_words)
print(df)