减少嵌套循环中的计算时间



假设我有一个包含3行文本的数据帧:

#(needed libraries):
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
from keyphrase_vectorizers import KeyphraseCountVectorizer
import pandas as pd
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
st_model = KeyBERT(model=sentence_model)
data = pd.DataFrame({'text':['Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.','Physics is the natural science that studies matter, its fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force. Physics is one of the most fundamental scientific disciplines, with its main goal being to understand how the universe behaves.','Chemistry is the branch of science that deals with the properties, composition, and structure of elements and compounds, how they can change, and the energy that is released or absorbed when they change.']})

我想浏览每一个文本(行(,如果它们遵循特定的模式,就选择最相关的,为此我做了一个双环(嵌套循环(,它可以完成任务,但需要大量的计算时间,我的方法如下:

pt = []
patterns = ['<J.*>*<N.*>+', '<V.*>+', '<N.*>*<V.*>+', '<J.*>*<N.*>*<V.*>+']
for j in range(len(data)):
for i in range(len(patterns)):
vectorizer = KeyphraseCountVectorizer(pos_pattern = patterns[i])
pt.append(st_model.extract_keywords(data.text.iloc[j],stop_words = "english", vectorizer=vectorizer,use_mmr=True, diversity=0.4)) 

正如您所看到的,有4种模式应该应用于3行文本。我想知道如何使这个嵌套循环在计算上更轻,因为我的原始数据集包含1000行,使用这种方法需要数小时才能执行。

请记住,我已经提供了库和确切的上下文,这样我的问题就可以重现,但真正的问题发生在嵌套循环所在的第二个代码块中。

更新:我现在也尝试使用itertools产品作为:

pt = []
vecz = []
patterns = ['<J.*>*<N.*>+', '<V.*>+', '<N.*>*<V.*>+', '<J.*>*<N.*>*<V.*>+']
for i in range(len(patterns)):
vectorizer = KeyphraseCountVectorizer(pos_pattern = patterns[i])
vecz.append(vectorizer)

from itertools import product  
for j, i in product(range(len(data)), range(len(vecz))):
pt.append(st_model.extract_keywords(data.text.iloc[j],stop_words = "english", vectorizer=vecz[i],use_mmr=True, diversity=0.4))

但性能并没有显著改善。

对于优化问题,您应该做的第一件事是度量。从终端运行以下操作:

python -m cProfile -s tottime yourscript.py

并观察结果表中tottime最高的行。

所有不到总时间5%的东西可能都不值得一看。

我反复写过关于评测和改进Python程序的文章。这是第一篇文章的链接。该系列文章的其余部分显示在该页的底部。

Python中内置的函数和方法通常不值得查看,除非您可以修改它们的参数。考虑以下字符串格式化操作:

In [1]: f = 139.3592529296875;
In [3]: %timeit str(f)
724 ns ± 0.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit "{0}".format(f)
734 ns ± 1.21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit "{0:.5f}".format(f)
314 ns ± 0.0365 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [7]: %timeit "{0:f}".format(f)
313 ns ± 7.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [8]: %timeit "{0:e}".format(f)
382 ns ± 0.171 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

观察使用格式说明符如何将转换的时间缩短一半。我研究这一点的原因是,通过分析,我发现我的一个程序花了大约50%的时间格式化字符串。。。

如果瓶颈在外部库中,那么是时候仔细阅读文档了。可能有一种比你现在使用的更快的方法。

优化

一个普遍的观察结果是,一个更好的算法往往胜过优化。鉴于我对你们使用的图书馆并不熟悉,我无法在这里提供具体建议。

加速环路的一般建议

  • 将所有不变量(迭代之间不变的东西(移到循环之外
  • 你能通过重新排列嵌套循环来创建不变量吗
  • 如果循环的迭代是独立的,即迭代的结果不依赖于上一次迭代,则考虑使用multiprocess并行运行它们。这不会减少总时间,但它将利用目前基本上每台计算机都有的多个CPU

有两件事可能会大大加快速度:

  1. 使矢量器退出循环
  2. 将整列传递给st_model.extract_keywords,它接受一个数组
patterns = ['<J.*>*<N.*>+', '<V.*>+', '<N.*>*<V.*>+', '<J.*>*<N.*>*<V.*>+']
vectorizers = [KeyphraseCountVectorizer(pos_pattern=pattern) for pattern in patterns]
for i, vectorizer in enumerate(vectorizers):
# Pass the whole column.
data[f'pattern{i}'] = st_model.extract_keywords(data.text, stop_words = "english", vectorizer=vectorizer, use_mmr=True, diversity=0.4)
print(data)

输出:

text                                           pattern0                                           pattern1                                           pattern2                                           pattern3
0  Machine learning (ML) is a type of artificial ...  [(machine learning algorithms, 0.6082), (ai, 0...  [(predicting, 0.3156), (programmed, 0.2338), (...  [(machine learning algorithms use, 0.6833), (p...  [(machine learning algorithms use, 0.6833), (p...
1  Physics is the natural science that studies ma...  [(physics, 0.554), (fundamental scientific dis...  [(matter, 0.3478), (behaves, 0.1066), (underst...  [(physics is, 0.64), (studies matter, 0.3398),...  [(physics is, 0.64), (universe behaves, 0.3568...
2  Chemistry is the branch of science that deals ...  [(chemistry, 0.6888), (science, 0.4082), (elem...  [(absorbed, 0.2731), (change, 0.1097), (is rel...  [(chemistry is, 0.7016), (absorbed, 0.2731), (...  [(chemistry is, 0.7016), (absorbed, 0.2731), (...

最新更新