是否可以优化 numba 函数中的这个循环以运行得更快

在我的python代码中，我需要循环大约 2500 万次，我想尽可能多地优化。循环中的操作非常简单。为了使代码高效，我使用了numba模块，它有很大帮助，但如果可能的话，我想进一步优化代码。

这是一个完整的工作示例：

import numba as nb
import numpy as np
import time 
#######create some synthetic data for illustration purpose##################
size=5000
eps = 0.2
theta_c = 0.4
temp = np.ones(size)
neighbour = np.random.randint(size, size=(size, 3)) 
coschi = np.random.random_sample((size))
theta = np.random.random_sample((size))*np.pi/2
pwr = np.cos(theta)
###################end of dummy data##########################
###################-----main loop------###############
@nb.jit(fastmath=True)
def func(theta, pwr, neighbour, coschi, temp):
for k in range(np.argmax(pwr), 5000*(pwr.size)): 
n = k%pwr.size
if (np.abs(theta[n]-np.pi/2.)<np.abs(theta_c)):
adj = neighbour[n,1]
else:
adj = neighbour[n,0]
psi_diff = np.abs(np.arccos(coschi[adj])-np.arccos(coschi[n]))
temp5 = temp[adj]**5;
e_temp = 1.- np.exp(-temp5*psi_diff/np.abs(eps))
temp[n] = temp[adj] + (e_temp)/temp5*(pwr[n] - temp[adj]**4)
return temp
#check time
time1 = time.time()
temp = func(theta, pwr, neighbour, coschi, temp)
print("Took: ", time.time()-time1, " seconds.")

这需要我的机器3.49 seconds。

出于某种模型拟合目的，我需要运行此代码数千次，因此即使是 1 秒的优化也意味着为我节省数十小时。

可以做些什么来进一步优化此代码？

让我从一些一般性评论开始：

如果您使用 numba

并且真正关心性能，则应避免 numba 创建对象模式代码的任何可能性。这意味着您应该使用numba.njit(...)或numba.jit(nopython=True, ...)而不是numba.jit(...).
这在您的情况下没有区别，但它使意图更清晰，并在(快速)nopython 模式下不支持某些内容时立即引发异常。
你应该小心你的时间和方式。对 numba-jitted 函数(未提前编译)的首次调用将包括编译成本。因此，您需要在计时之前运行一次以获得准确的计时。为了获得更准确的计时，您应该多次调用该函数。我喜欢 Jupyter Notebooks/Lab 中%timeit的 IPythons 来大致了解性能。
所以我将使用：
```
res1 = func(theta, pwr, neighbour, coschi, np.ones(size))
res2 = # other approach
np.testing.assert_allclose(res1, res2)
%timeit func(theta, pwr, neighbour, coschi, np.ones(size))
%timeit # other approach
```
这样，我将第一次调用(包括编译时间)与断言一起使用，以确保它确实产生(几乎)相同的输出，然后使用更健壮的计时方法(与time相比)对函数进行计时。

吊装`np.arccos`

现在让我们从一些实际的性能优化开始：一个明显的是你可以提升一些"不变量"，例如，np.arccos(coschi[...])的计算频率比coschi中的实际元素要高得多。你以大约 5000 次coschi迭代每个元素，它每次循环执行两个np.arccos！因此，让我们计算一次coschi的arccos并将其存储在中间数组中，以便可以在循环中访问它：

@nb.njit(fastmath=True)
def func2(theta, pwr, neighbour, coschi, temp):
arccos_coschi = np.arccos(coschi)
for k in range(np.argmax(pwr), 5000 * pwr.size): 
n = k % pwr.size
if np.abs(theta[n] - np.pi / 2.) < np.abs(theta_c):
adj = neighbour[n, 1]
else:
adj = neighbour[n, 0]
psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
temp5 = temp[adj]**5;
e_temp = 1. - np.exp(-temp5 * psi_diff / np.abs(eps))
temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
return temp

在我的电脑上，这已经快得多了：

1.73 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # original
811 ms ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # func2

然而，这是有代价的：结果会有所不同！使用原始版本和带有fastmath=True的提升版本，我始终得到显着不同的结果。但是，结果(几乎)等于fastmath=False。似乎fastmath可以使用np.arccos(coschi[adj]) - np.arccos(coschi[n])进行一些严格的优化，这在吊装np.arccos时是不可能的。在我个人看来，如果您关心确切的结果，或者您已经测试过结果的准确性不受快速数学的显着影响，我会忽略fastmath=True！

吊装`adj`

接下来要吊装的是adj，它的计算频率也比必要的多得多：

@nb.njit(fastmath=True)
def func3(theta, pwr, neighbour, coschi, temp):
arccos_coschi = np.arccos(coschi)
associated_neighbour = np.empty(neighbour.shape[0], nb.int64)
for idx in range(neighbour.shape[0]):
if np.abs(theta[idx] - np.pi / 2.) < np.abs(theta_c):
associated_neighbour[idx] = neighbour[idx, 1]
else:
associated_neighbour[idx] = neighbour[idx, 0]
for k in range(np.argmax(pwr), 5000 * pwr.size): 
n = k % pwr.size
adj = associated_neighbour[n]
psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
temp5 = temp[adj]**5;
e_temp = 1. - np.exp(-temp5 * psi_diff / np.abs(eps))
temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
return temp

这样做的效果不是那么大，但可以衡量：

1.75 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # original
761 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func2
660 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func3

提升额外的计算似乎对我计算机的性能没有影响，所以我不在这里包括它们。所以这似乎是你在不改变算法的情况下能走多远。

重构为更小的功能(+小的附加更改)

但是，我建议在其他函数中分离提升并使所有变量成为函数参数，而不是查找全局变量。这可能不会导致加速，但它可以使代码更具可读性：

@nb.njit
def func4_inner(indices, pwr, associated_neighbour, arccos_coschi, temp, abs_eps):
for n in indices:
adj = associated_neighbour[n]
psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
temp5 = temp[adj]**5;
e_temp = 1. - np.exp(-temp5 * psi_diff / abs_eps)
temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
return temp
@nb.njit
def get_relevant_neighbor(neighbour, abs_theta_minus_pi_half, abs_theta_c):
associated_neighbour = np.empty(neighbour.shape[0], nb.int64)
for idx in range(neighbour.shape[0]):
if abs_theta_minus_pi_half[idx] < abs_theta_c:
associated_neighbour[idx] = neighbour[idx, 1]
else:
associated_neighbour[idx] = neighbour[idx, 0]
return associated_neighbour
def func4(theta, pwr, neighbour, coschi, temp, theta_c, eps):
arccos_coschi = np.arccos(coschi)
abs_theta_minus_pi_half = np.abs(theta - (np.pi / 2.))
relevant_neighbors = get_relevant_neighbor(neighbour, abs_theta_minus_pi_half, abs(theta_c))
argmax_pwr = np.argmax(pwr)
indices = np.tile(np.arange(pwr.size), 5000)[argmax_pwr:]
return func4_inner(indices, pwr, relevant_neighbors, arccos_coschi, temp, abs(eps))

在这里，我还做了一些额外的更改：

事先使用np.tile和切片而不是range方法和%计算指数。
使用普通的 NumPy(numba 之外)来计算np.arccos。

最终时间和摘要

1.79 s ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # original
844 ms ± 41.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # func2
707 ms ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # func3
550 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # func4

所以最后最新的方法大约比原来的方法快3倍(没有fastmath)。如果您确定要使用fastmath，那么只需在func4_inner上应用fastmath=True，它会更快：

499 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # func4 with fastmath on func4_inner

但是，正如我已经说过的fastmath如果您想要确切(或至少不太不准确)的结果，这可能不合适。

此外，这里的一些优化在很大程度上取决于可用的硬件和处理器缓存(特别是对于代码中内存带宽有限的部分)。您必须检查这些方法在计算机上的相对性能。

Numba 真的很棒。但是你很绝望，记住你总是可以用C(youtube)写的。在我自己的问题上，我仅仅通过逐行翻译成 C 就比 numba 提高了 30%。

如果你想花这个精力，我建议使用特征进行向量运算(在编译时知道向量大小)和pybind11，因为它在numpy和特征之间原生转换。当然，将你的主循环保持在Python中。确保使用适当的编译器标志(如-O3-march=native、-mtune=native、-ffast-math)并尝试不同的编译器(对我来说gcc输出比clang快 2 倍，但同事报告了相反的情况)。

如果你不知道任何C++那么将自己限制在纯C而不是没有库可能更聪明(因为它降低了复杂性)。但是你将直接处理Python和numpy C API(不是那么难，但更多的代码，你会学到所有关于Python内部的知识)。

看起来您可能正在处理示例中的大量重复项。

在这个版本中，我没有重新计算我们已经看到的"n"的任何值。

我不知道这在您的情况下是否可以，但它为我节省了 ~0.4 秒。

#!/usr/bin/env python

import numba as nb
import numpy as np
import time
#######create some synthetic data for illustration purpose##################
size = 5000
eps = 0.2
theta_c = 0.4
temp = np.ones(size)
neighbour = np.random.randint(size, size=(size, 3))
coschi = np.random.random_sample((size))
theta = np.random.random_sample((size))*np.pi/2
pwr = np.cos(theta)
###################end of dummy data##########################
###################-----main loop------###############
@nb.jit(fastmath=True)
def func(theta, pwr, neighbour, coschi, temp):
hashtable = {}
for k in range(np.argmax(pwr), 5000*(pwr.size)):
n = k % pwr.size
if not hashtable.get(n, False):
hashtable[n] = 1
#taking into account regions with different super wind direction
if (np.abs(theta[n]-np.pi/2.) < np.abs(theta_c)):
adj = neighbour[n, 1]
else:
adj = neighbour[n, 0]
psi_diff = np.abs(np.arccos(coschi[adj])-np.arccos(coschi[n]))
temp5 = temp[adj]**5
e_temp = 1. - np.exp(-temp5*psi_diff/np.abs(eps))
retval = temp[adj] + (e_temp)/temp5*(pwr[n] - temp[adj]**4)
temp[n] = retval

return temp

#check time
time1 = time.time()
result = func(theta, pwr, neighbour, coschi, temp)
print("Took: ", time.time()-time1, "

原件：哈希表

2.3726098537445070 ： 1.8722639083862305

2.3447792530059814 ： 1.9053585529327393

2.3363733291625977 ： 1.9104151725769043

2.3447978496551514 ： 1.9298338890075684

2.4740016460418700 ： 1.9088914394378662

使用 np.ones 的 25M 项的裸骨循环：

#!/usr/bin/env python

import numba as nb
import numpy as np
import time
temp = np.ones(25000000)
@nb.jit(fastmath=True)
def func(temp):
return [n for n in temp]
time1 = time.time()
result = func(temp)
print("Took: ", time.time()-time1, " seconds for ", len(temp), " items")

获取：1.2502222061157227 秒，25000000 个项目

已获取：1.294729232788086 秒，用于 25000000 个项目

已获取：1.2670648097991943 秒，用于 25000000 个项目

已执行：1.2386720180511475 秒，用于 25000000 个项目

已获取：1.2517566680908203 秒，用于 25000000 个项目

吊装`np.arccos`

吊装`adj`

重构为更小的功能(+小的附加更改)

最终时间和摘要

相关内容

最新更新

热门标签：

是否可以优化 numba 函数中的这个循环以运行得更快

吊装np.arccos

吊装adj

重构为更小的功能(+小的附加更改)

最终时间和摘要

相关内容

最新更新

热门标签：

吊装`np.arccos`

吊装`adj`