如何使用 numpy 和 numba 提高 Python 脚本的性能?



如何使用numpy和numba提高Python脚本的性能?

我正在尝试将十进制数转换为 21 位数字系统。

输入: [15, 18, 28, 11, 7, 5, 41, 139, 6, 507]

输出: [[15

], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]我的脚本在使用 CPU 时运行良好。

如何修改脚本?我想使用 GPU 提高性能。

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import numba as nb
elements = [
"n|0",
"n|1",
"n|2",
"n|3",
"n|4",
"n|5",
"n|6",
"n|7",
"n|8",
"n|9",
"n|10",
"o|+",
"o|*",
"o|/",
"om|-",
"bl|(",
"br|)",
"e|**2",
"e|**3",
"e|**0.5",
"e|**(1/3)",
]
elements_len = len(elements)
def decimal_to_custom(number):
x = (number % elements_len)
ch = [x]
if (number - x != 0):
return decimal_to_custom(number // elements_len) + ch
else:
return ch
decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
custom_numers = []
for decimal_number in decimal_numbers:
custom_numer = decimal_to_custom(decimal_number)
custom_numers.append(custom_numer)
print(custom_numers)

您的代码可以概括为:

import numpy as np

def decimal_to_custom(number, k):
x = (number % k)
ch = [x]
if (number - x != 0):
return decimal_to_custom(number // k, k) + ch
else:
return ch

def remainders_OP(arr, k):    
result = []
for value in arr:
result.append(decimal_to_custom(value, k))
return result

decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
print(remainders_OP(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]

通过用迭代和更简单的版本替换decimal_to_custom()昂贵的递归实现,可以加速此代码,该版本mod_list()可以附加和还原,而不是在OP中实现的非常昂贵的头插入(相当于list.insert(0, x)):

def mod_list(x, k):
result = []
while x >= k:
result.append(x % k)
x //= k
result.append(x)
return result[::-1]

def remainders(arr, k):
result = []
for x in arr:
result.append(mod_list(x, k))
return result

print(remainders(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]

现在,两者都可以使用 Numba 加速,以获得一些加速:

import numba as nb

@nb.njit
def mod_list_nb(x, k):
result = []
while x >= k:
result.append(x % k)
x //= k
result.append(x)
return result[::-1]

@nb.njit
def remainders_nb(arr, k):
result = []
for x in arr:
result.append(mod_list_nb(x, k))
return result

print(remainders_nb(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]

可以将许多选项传递给装饰器,包括target_backend="cuda"让计算在 GPU 上运行。 正如我们将在基准中看到的那样,这不会是有益的。 原因是list.append()(以及list.insert())不容易并行运行,因此您无法轻松利用GPU的大规模并行性!

无论如何,上述解决方案会因选择基础数据容器而减慢速度。

如果使用固定大小的数组而不是在每次迭代时动态增长list,这将导致更快的执行:

def remainders_fixed_np(arr, k, m):
arr = arr.copy()
n = len(arr)
result = np.empty((n, m), dtype=np.int_)
for i in range(m - 1, -1, -1):
result[:, i] = arr[:, i + 1] % k
arr //= k
return result

print(remainders_fixed_np(decimal_numbers, elements_len, 3).T)
# [[ 0  0  0  0  0  0  0  0  0  1]
#  [ 0  0  1  0  0  0  1  6  0  3]
#  [15 18  7 11  7  5 20 13  6  3]]

或者,使用 Numba 加速(并避免不必要的计算):

@nb.njit
def remainders_fixed_nb(arr, k, m):
n = len(arr)
result = np.zeros((n, m), dtype=np.int_)
for i in range(n):
j = m - 1
x = arr[i]
while x >= k:
q, r = divmod(x, k)
result[i, j] = r
x = q
j -= 1
result[i, j] = x
return result

print(remainders_fixed_nb(decimal_numbers, elements_len, 3).T)
# [[ 0  0  0  0  0  0  0  0  0  1]
#  [ 0  0  1  0  0  0  1  6  0  3]
#  [15 18  7 11  7  5 20 13  6  3]]
<小时 />

一些基准

现在,在Google Colab上运行的一些基准测试显示了一些指示性时间,其中:

  • _nb结尾表示努姆巴加速
  • _pnb结尾表示 Numba 加速,parallel=True,最外range()替换为nb.prange()
  • _cunb结尾表示使用目标 CUDAtarget_backend="cuda"加速 Numba 加速
  • _cupnb是具有并行化和目标 CUDA 的 Numba 加速
m = 4
n = 100000
arr = np.random.randint(1, k ** m - 1, n)
funcs = remainders_OP, remainders, remainders_nb, remainders_cunb
base = funcs[0](arr, k)
for func in funcs:
res = func(arr, k)
is_good = base == res
print(f"{func.__name__:>16s}  {is_good!s:>5s}  ", end="")
%timeit -n 4 -r 4 func(arr, k)
#    remainders_OP   True  333 ms ± 4.38 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
#       remainders   True  268 ms ± 5.11 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
#    remainders_nb   True  46.9 ms ± 3.16 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
#  remainders_cunb   True  46.4 ms ± 1.71 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)

fixed_funcs = remainders_fixed_np, remainders_fixed_nb, remainders_fixed_pnb, remainders_fixed_cunb, remainders_fixed_cupnb
base = fixed_funcs[0](arr, k, m)
for func in fixed_funcs:
res = func(arr, k, m)
is_good = np.all(base == res)
print(f"{func.__name__:>24s}  {is_good!s:>5s}  ", end="")
%timeit -n 8 -r 8 func(arr, k, m)
#      remainders_fixed_np   True  10 ms ± 2.09 ms per loop (mean ± std. dev. of 8 runs, 8 loops each)
#      remainders_fixed_nb   True  3.6 ms ± 315 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
#     remainders_fixed_pnb   True  2.68 ms ± 550 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
#    remainders_fixed_cunb   True  3.49 ms ± 192 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
#   remainders_fixed_cupnb   True  2.63 ms ± 314 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)

这表明在 GPU 上运行的影响最小。 通过将数据容器更改为预分配的容器,可以获得最大的加速。 Numba 加速为动态分配和预分配版本提供了一些加速。

最新更新