自动对数据帧列上的'function apply'进行多处理



我有一个简单的数据帧,有两列。

+---------+-------+ | subject | score |
+---------+-------+ | wow     | 0     |
+---------+-------+ | cool    | 0     |
+---------+-------+ | hey     | 0     |
+---------+-------+ | there   | 0     |
+---------+-------+ | come on | 0     |
+---------+-------+ | welcome | 0     |
+---------+-------+
对于"主题"列

中的每条记录,我正在调用一个函数并更新"score"列中的结果:

df['score'] = df['subject'].apply(find_score)
Here find_score is a function, which processes strings and returns a score :
def find_score (row):
    # Imports the Google Cloud client library
    from google.cloud import language
    # Instantiates a client
    language_client = language.Client()
    import re
    pre_text = re.sub('<[^>]*>', '', row)
    text = re.sub(r'[^w]', ' ', pre_text)
    document = language_client.document_from_text(text)
    # Detects the sentiment of the text
    sentiment = document.analyze_sentiment().sentiment
    print("Sentiment score - %f " % sentiment.score) 
    return sentiment.score

这按预期工作正常,但速度很慢,因为它逐个处理记录。

有没有办法,这可以并行化? 无需手动将数据帧拆分为较小的块?有没有自动执行此操作的库?

干杯

每次调用 find_score 函数时language.Client的实例化可能是一个主要瓶颈。您不需要为每次使用该函数创建新的客户端实例,因此在调用函数之前,请尝试在函数外部创建它:

# Instantiates a client
language_client = language.Client()
def find_score (row):
    # Imports the Google Cloud client library
    from google.cloud import language

    import re
    pre_text = re.sub('<[^>]*>', '', row)
    text = re.sub(r'[^w]', ' ', pre_text)
    document = language_client.document_from_text(text)
    # Detects the sentiment of the text
    sentiment = document.analyze_sentiment().sentiment
    print("Sentiment score - %f " % sentiment.score) 
    return sentiment.score
df['score'] = df['subject'].apply(find_score)

如果你坚持,你可以像这样使用多处理:

from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()

相关内容

  • 没有找到相关文章

最新更新