Pandas应用函数工作非常缓慢



我将以下函数传递给一个有300万条注释的pandas列,以便提取形容词。我希望它能很快完成,因为它可以在并行计算中完成。尽管这大约需要5个小时左右的时间,就好像这是一个for循环。有什么可能的解决办法吗?像Cython?

def get_adjectives(row):
clean_row=''
if type(row)==str:
for word in row.split():
if nltk.pos_tag([word])[0][1] in ['JJ','JJR','JJS']:
clean_row=clean_row+word+' '
return clean_row
df['adjectives'] = df[text_column].apply(get_adjectives)

基于@ead的注释进行构建。试试这个:

def get_adjectives(row):
clean_row = [] # list, not str
if type(row)==str:
for word in row.split():
if nltk.pos_tag([word])[0][1] in ['JJ','JJR','JJS']:
clean_row.append(word) # Appending to list

clean_row = ' '.join(clean_row) # joining all words in list, separated by space
return clean_row
df['adjectives'] = df[text_column].apply(get_adjectives)

这是一个快速的解决方案,但如果你更快地导入,它可以为你寻找一个矢量化的解决方案:

def get_adjectives(row):
clean_row = ''
if type(row) == str:
for word in row.split():
if nltk.pos_tag([word])[0][1] in ['JJ','JJR','JJS']:
clean_row = clean_row + word + ' '
return clean_row
import swifter
df['adjectives'] = df[text_column].swifter.apply(get_adjectives)

最新更新