优化用于清理 n 元语法的运行时间



我正在尝试clean从文本列获得n-grams。我还有 2 个要删除的stopwords列表,但仅在特定位置(当它作为 n 元语法中的第一个单词出现或作为 n 元语法中的最后一个单词出现时(,我也希望在remove我的 n-gram 只包含numbers or %.以下代码需要 10 分钟以上的时间来处理大约 100 万个 n-gram。

def clean_ngram(ng):
if 'percent' in ng:
ng = ng.replace('percent', '%')
if 'point' in ng:
ng = ng.replace('point', '.')
if ng.split(' ')[0] not in stopwords['First'].dropna().values 
and ng.split(' ')[-1] not in stopwords['Last '].dropna().values 
and (bool(re.match(r"^[0-9.% ]+$", ng)) == False):
return ng
df['Word'] = df['Word'].apply(lambda x: clean_ngram(x))

我也尝试过多处理,但我不得不在 30 分钟后终止进程,因为它仍在运行。以下是相同的代码:

p = Pool(processes=2)
df['Word'] = p.map(clean_ngram, df['Word'])
p.close()
p.join()

有什么方法可以优化我的代码以大大缩短运行时间?任何帮助将不胜感激。 提前致谢:)

在下面的代码中,我创建了随机停用词(也包含一些nans(和随机词。然后使用我创建的函数,我检查了当您在代码中执行非索引字时将 pandas 系列转换为 numpy 时需要多长时间才能运行。我还检查了将停用词转换为类似长度的简单 python 集需要多长时间。

从输出中可以看出,使用 set 时,它的速度几乎快了 2000 倍

因此,我建议您不要使用not in stopwords['First'].dropna().values,而是将stopwords['First'].dropna().values转换为set(使用stopwords_first = set(stopwords['First'].dropna().values)(,然后您只需执行not in stopwords_first即可。请注意,要设置的转换必须在函数外部完成(这样您就不会在每次检查另一个 ngram 时都转换停用词(

这只是一个例子,您的速度将在很大程度上取决于您的数据,但我相信它会有所帮助。

#first generate random 450 stopwords + 50 nans
>>> stopwords = np.array(['word_num'+str(i) for i in range(450)]+[np.nan for _ in range(50)])
#shuffle the stopwords and print some of them
>>> stopwords = pd.Series(stopwords).sample(frac=1)
>>> stopwords
304    word_num304
84      word_num84
215    word_num215
438    word_num438
276    word_num276
...     
217    word_num217
280    word_num280
69      word_num69
365    word_num365
404    word_num404
Length: 500, dtype: object
#generate random words to be checked if they are in stopwords
>>> ngrams = ['word_num{}'.format(int(np.random.rand()*1000)) for _ in range(20000)]
>>> ngrams = np.array(ngrams)
>>> ngrams
array(['word_num642', 'word_num729', 'word_num901', ..., 'word_num940',
'word_num616', 'word_num58'], dtype='<U11')
#define function that checks words presence in stopwords pd.Series (same way as you did)
#this function returns also time it took to run
>>> def check_ngrams_Series(ngrams,stopwords):
...     func = lambda ng: ng in stopwords.dropna().values
...     time_begin = time()
...     result = list(map(func,ngrams))
...     time_end = time()
...     return np.array(result), time_end-time_begin
#define function that checks words presence in stopwords converted to set
#this function returns also time it took to run
>>> def check_ngrams_set(ngrams,stopwords):
...     func = lambda ng: ng in stopwords
...     time_begin = time()
...     result = list(map(func,ngrams))
...     time_end = time()
...     return np.array(result), time_end-time_begin
#try to run both functions
>>> series_out = check_ngrams_Series(ngrams,stopwords)
>>> sets_out = check_ngrams_set(ngrams,set(stopwords))
#checks their first output (words presence in stopwords) is same
>>> np.all(sets_out[0] == series_out[0])
True
#show how long it took to function that uses set to run
>>> sets_out[1]
0.008014917373657227
#show how long it took to function that uses pd.Series to run
>>> series_out[1]
15.30849814414978

最新更新