了解共享内存的使用情况以改进Numba



我正在尝试了解更多关于使用共享内存来提高Numba中一些cuda内核的性能的信息,为此,我查看了Numba文档中的矩阵乘法示例,并尝试实现以查看增益。

这是我的测试实现,我知道文档中的示例有一些问题,我遵循了Here,所以我复制了修复的示例代码。

from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *

@cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x    # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)
s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)
s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s
print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)

上述内核的执行时间基本相同,实际上matmul对于一些较大的输入矩阵来说速度更快。我想知道我遗漏了什么,以便看到文件显示的收益。

谢谢,布鲁诺。

我在另一个答案中的代码中犯了一个性能错误。我现在已经修复了。简而言之,这一行:

tmp = 0.

导致numba创建64位浮点变量CCD_ 3。这触发了内核中的其他算法从32位浮点提升到64位浮点。这与其余的算术不一致,也与另一个答案中演示的意图不一致。此错误会影响两个内核。

当我在两个内核中都将其更改为时

tmp = float32(0.)

两个内核都明显更快,在我的GTX960 GPU上,您的测试用例显示共享代码的运行速度比非共享代码快约2倍(但请参阅下文(。

非共享内核还存在与内存访问模式有关的性能问题。类似于另一个答案中的索引交换,仅针对此特定场景,我们可以通过反转指定的索引来纠正此问题:

j, i = cuda.grid(2)

在非共享内核中。这允许该内核尽可能好地执行,并且通过这种更改,共享内核的运行速度比非共享内核快大约2倍。如果不对非共享内核进行额外的更改,非共享内核的性能就会差得多。

相关内容

最新更新