我有一个包含 8500 行文本的数据集。我想在这些行中的每一行上应用一个函数pre_process
。当我连续执行时,在我的计算机上大约需要 42 分钟:
import pandas as pd
import time
import re
### constructing a sample dataframe of 10 rows to demonstrate
df = pd.DataFrame(columns=['text'])
df.text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
"The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
"You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
'Yet the act is still charming here .',
"Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
"a screenplay more ingeniously constructed than `` Memento ''",
"`` Extreme Ops '' exceeds expectations ."]
def pre_process(text):
'''
function to pre-process and clean text
'''
stop_words = ['in', 'of', 'at', 'a', 'the']
# lowercase
text=str(text).lower()
# remove special characters except spaces, apostrophes and dots
text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text)
# remove stopwords
text=[word for word in text.split(' ') if word not in stop_words]
return ' '.join(text)
t = time.time()
for i in range(len(df)):
df.text[i] = pre_process(df.text[i])
print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))
>>> Time taken for pre-processing the data = 41.95724259614944 mins
因此,我想利用多处理来完成此任务。我从这里获得了帮助并编写了以下代码:
import pandas as pd
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
def func(text):
return pre_process(text)
t = time.time()
results = pool.map(func, [df.text[i] for i in range(len(df))])
print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))
pool.close()
但是代码只是继续运行,并没有停止。
我怎样才能让它工作?
你可以使用熊猫。DataFrame.apply
df.text= df.text.apply(pre_process)
以下代码对我有用。我不使用func
,而是直接使用pre_process
。另外,我在池上使用上下文管理器/with
语句
下面是在IPython
中运行的代码。
In [1]: from multiprocessing import Pool, TimeoutError
...: import time
...: import os
In [2]: text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to
...: make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
...:
...: "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a
...: column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision
...: of J.R.R. Tolkien 's Middle-earth .",
...: 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more s
...: imply intrusive to the story -- but the whole package certainly captures the intended , er , spi
...: rit of the piece .',
...: "You 'd think by now America would have had enough of plucky British eccentrics with hearts of
...: gold .",
...: 'Yet the act is still charming here .',
...: "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the
...: self , '' Derrida is an undeniably fascinating and playful fellow .",
...: 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro o
...: f madness and light is astonishing .',
...: 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
...: "a screenplay more ingeniously constructed than `` Memento ''",
...: "`` Extreme Ops '' exceeds expectations ."]
In [3]: def pre_process(text):
...: '''
...: function to pre-process and clean text
...: '''
...: stop_words = ['in', 'of', 'at', 'a', 'the']
...:
...: # lowercase
...: text=str(text).lower()
...:
...: # remove special characters except spaces, apostrophes and dots
...: text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text)
...:
...: # remove stopwords
...: text=[word for word in text.split(' ') if word not in stop_words]
...:
...: return ' '.join(text)
In [4]: %%time
...: result = []
...: for x in text:
...: result.append(pre_process(x))
...:
...:
CPU times: user 500 µs, sys: 59 µs, total: 559 µs
Wall time: 569 µs
In [5]: %%time
...: with Pool(mp.cpu_count()) as pool:
...: results = pool.map(pre_process, text)
...:
...:
CPU times: user 4.58 ms, sys: 29 ms, total: 33.6 ms
Wall time: 137 ms
In [6]: results
Out[6]:
["rock is destined to be 21st century 's new conan '' and that he 's going to make splash even greater than arnold schwarzenegger jean claud van damme or steven segal .",
"gorgeously elaborate continuation lord rings '' trilogy is so huge that column words can not adequately describe co writer director peter jackson 's expanded vision j.r.r. tolkien 's middle earth .",
'singer composer bryan adams contributes slew songs few potential hits few more simply intrusive to story but whole package certainly captures intended er spirit piece .',
"you 'd think by now america would have had enough plucky british eccentrics with hearts gold .",
'yet act is still charming here .',
"whether or not you 're enlightened by any derrida 's lectures on other '' and self '' derrida is an undeniably fascinating and playful fellow .",
'just labour involved creating layered richness imagery this chiaroscuro madness and light is astonishing .',
'part charm satin rouge is that it avoids obvious with humour and lightness .',
"screenplay more ingeniously constructed than memento ''",
" extreme ops '' exceeds expectations ."]
%%time
是测量细胞执行时间的IPython
魔术。当然,使用这种小样本数据,由于创建新进程的开销,多处理实际上运行得更慢。
无论如何,使用Pandas.DataFrame
您只需按如下所示list()
将列/Series
转换为列表,而不是遍历它,这更有效。
list(df.text)
下面是使用list()
而不是像您那样迭代它的性能比较。总计为 197 μs vs 564 μs。
In [52]: %%time
...: [s[i] for i in range(len(s))]
...:
...:
CPU times: user 499 µs, sys: 65 µs, total: 564 µs
Wall time: 506 µs
Out[52]:
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
"The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
"You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
'Yet the act is still charming here .',
"Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
"a screenplay more ingeniously constructed than `` Memento ''",
"`` Extreme Ops '' exceeds expectations ."]
In [53]: %%time
...: list(s)
...:
...:
CPU times: user 174 µs, sys: 23 µs, total: 197 µs
Wall time: 215 µs
Out[53]:
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
"The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
"You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
'Yet the act is still charming here .',
"Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
"a screenplay more ingeniously constructed than `` Memento ''",
"`` Extreme Ops '' exceeds expectations ."]