如何使用numpy和numba提高Python脚本的性能?
我正在尝试将十进制数转换为 21 位数字系统。
输入: [15, 18, 28, 11, 7, 5, 41, 139, 6, 507]
输出: [[15
], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]我的脚本在使用 CPU 时运行良好。
如何修改脚本?我想使用 GPU 提高性能。
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import numba as nb
elements = [
"n|0",
"n|1",
"n|2",
"n|3",
"n|4",
"n|5",
"n|6",
"n|7",
"n|8",
"n|9",
"n|10",
"o|+",
"o|*",
"o|/",
"om|-",
"bl|(",
"br|)",
"e|**2",
"e|**3",
"e|**0.5",
"e|**(1/3)",
]
elements_len = len(elements)
def decimal_to_custom(number):
x = (number % elements_len)
ch = [x]
if (number - x != 0):
return decimal_to_custom(number // elements_len) + ch
else:
return ch
decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
custom_numers = []
for decimal_number in decimal_numbers:
custom_numer = decimal_to_custom(decimal_number)
custom_numers.append(custom_numer)
print(custom_numers)
您的代码可以概括为:
import numpy as np
def decimal_to_custom(number, k):
x = (number % k)
ch = [x]
if (number - x != 0):
return decimal_to_custom(number // k, k) + ch
else:
return ch
def remainders_OP(arr, k):
result = []
for value in arr:
result.append(decimal_to_custom(value, k))
return result
decimal_numbers = np.array([15, 18, 28, 11, 7, 5, 41, 139, 6, 507]) #very big array
print(remainders_OP(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
通过用迭代和更简单的版本替换decimal_to_custom()
昂贵的递归实现,可以加速此代码,该版本mod_list()
可以附加和还原,而不是在OP中实现的非常昂贵的头插入(相当于list.insert(0, x)
):
def mod_list(x, k):
result = []
while x >= k:
result.append(x % k)
x //= k
result.append(x)
return result[::-1]
def remainders(arr, k):
result = []
for x in arr:
result.append(mod_list(x, k))
return result
print(remainders(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
现在,两者都可以使用 Numba 加速,以获得一些加速:
import numba as nb
@nb.njit
def mod_list_nb(x, k):
result = []
while x >= k:
result.append(x % k)
x //= k
result.append(x)
return result[::-1]
@nb.njit
def remainders_nb(arr, k):
result = []
for x in arr:
result.append(mod_list_nb(x, k))
return result
print(remainders_nb(decimal_numbers, elements_len))
# [[15], [18], [1, 7], [11], [7], [5], [1, 20], [6, 13], [6], [1, 3, 3]]
可以将许多选项传递给装饰器,包括target_backend="cuda"
让计算在 GPU 上运行。 正如我们将在基准中看到的那样,这不会是有益的。 原因是list.append()
(以及list.insert()
)不容易并行运行,因此您无法轻松利用GPU的大规模并行性!
无论如何,上述解决方案会因选择基础数据容器而减慢速度。
如果使用固定大小的数组而不是在每次迭代时动态增长list
,这将导致更快的执行:
def remainders_fixed_np(arr, k, m):
arr = arr.copy()
n = len(arr)
result = np.empty((n, m), dtype=np.int_)
for i in range(m - 1, -1, -1):
result[:, i] = arr[:, i + 1] % k
arr //= k
return result
print(remainders_fixed_np(decimal_numbers, elements_len, 3).T)
# [[ 0 0 0 0 0 0 0 0 0 1]
# [ 0 0 1 0 0 0 1 6 0 3]
# [15 18 7 11 7 5 20 13 6 3]]
或者,使用 Numba 加速(并避免不必要的计算):
@nb.njit
def remainders_fixed_nb(arr, k, m):
n = len(arr)
result = np.zeros((n, m), dtype=np.int_)
for i in range(n):
j = m - 1
x = arr[i]
while x >= k:
q, r = divmod(x, k)
result[i, j] = r
x = q
j -= 1
result[i, j] = x
return result
print(remainders_fixed_nb(decimal_numbers, elements_len, 3).T)
# [[ 0 0 0 0 0 0 0 0 0 1]
# [ 0 0 1 0 0 0 1 6 0 3]
# [15 18 7 11 7 5 20 13 6 3]]
<小时 />一些基准
现在,在Google Colab上运行的一些基准测试显示了一些指示性时间,其中:
_nb
结尾表示努姆巴加速_pnb
结尾表示 Numba 加速,parallel=True
,最外range()
替换为nb.prange()
_cunb
结尾表示使用目标 CUDAtarget_backend="cuda"
加速 Numba 加速_cupnb
是具有并行化和目标 CUDA 的 Numba 加速
m = 4
n = 100000
arr = np.random.randint(1, k ** m - 1, n)
funcs = remainders_OP, remainders, remainders_nb, remainders_cunb
base = funcs[0](arr, k)
for func in funcs:
res = func(arr, k)
is_good = base == res
print(f"{func.__name__:>16s} {is_good!s:>5s} ", end="")
%timeit -n 4 -r 4 func(arr, k)
# remainders_OP True 333 ms ± 4.38 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
# remainders True 268 ms ± 5.11 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
# remainders_nb True 46.9 ms ± 3.16 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
# remainders_cunb True 46.4 ms ± 1.71 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
fixed_funcs = remainders_fixed_np, remainders_fixed_nb, remainders_fixed_pnb, remainders_fixed_cunb, remainders_fixed_cupnb
base = fixed_funcs[0](arr, k, m)
for func in fixed_funcs:
res = func(arr, k, m)
is_good = np.all(base == res)
print(f"{func.__name__:>24s} {is_good!s:>5s} ", end="")
%timeit -n 8 -r 8 func(arr, k, m)
# remainders_fixed_np True 10 ms ± 2.09 ms per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_nb True 3.6 ms ± 315 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_pnb True 2.68 ms ± 550 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_cunb True 3.49 ms ± 192 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
# remainders_fixed_cupnb True 2.63 ms ± 314 µs per loop (mean ± std. dev. of 8 runs, 8 loops each)
这表明在 GPU 上运行的影响最小。 通过将数据容器更改为预分配的容器,可以获得最大的加速。 Numba 加速为动态分配和预分配版本提供了一些加速。