从pandas列中删除停止词


import nltk
nltk.download('punkt')
nltk.download('stopwords')
import datetime
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
data = pd.read_csv("march20_21.csv") 
# Preview the first 5 lines of the loaded data 
#drop NA rows
data.dropna()
#drop all columns not needed
droppeddata = data.drop(columns=['created_at'])
#drop NA rows
alldata = droppeddata.dropna()
ukdata = alldata[alldata.place.str.contains('England')]
ukdata.drop(columns=['place'])
ukdata['text'].apply(word_tokenize)
eng_stopwords = stopwords.words('english') 

我知道有很多多余的变量,但我仍然在努力使它工作,然后再回去完善它。

我不确定如何从标记列中删除存储在变量中的停止词。任何帮助是感激的,我是全新的Python!谢谢。

  1. 在对列应用函数后,您需要将结果分配回该列,这不是原地操作。

  2. 标记化后的ukdata['text']保存了list的单词,因此您可以在应用程序中使用列表推导来删除停止词。


ukdata['text'] = ukdata['text'].apply(word_tokenize)
eng_stopwords = stopwords.words('english') 
ukdata['text'] = ukdata['text'].apply(lambda words: [word for word in words if word not in eng_stopwords])

最小的例子:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english') 
ukdata = pd.DataFrame({'text': ["This is a sentence."]})
ukdata['text'] = ukdata['text'].apply(word_tokenize)
ukdata['text'] = ukdata['text'].apply(lambda words: [word for word in words if word not in eng_stopwords])

最新更新