火炬如何在几乎为零的时间内将两个 10000*10000 矩阵相乘？为什么速度从 349 毫秒到 999 毫秒变化如此之大？

这是jupyter的摘录：

在[1]中：

import torch, numpy as np, datetime
cuda = torch.device('cuda')

在[2]中：

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

壁时间：349 MS
张量(17.0374，device ='cuda：0'(张量(17.0376，device ='cuda：0'(

时间很低，但仍然合理(1E12乘法为0.35秒(

，但是如果我们重复相同：

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

壁时间：999 µs
张量(-78.7172，device ='cuda：0'(张量(-78.7173，device ='cuda：0'(

1e12 1ms中的乘法？！

为什么时间从349ms变为1ms？

信息：

在GeForce RTX 2070上进行了测试;
可以在Google Colab上复制。

已经在讨论pytorch上已经有一个讨论：测量gpu张量操作速度。

我想突出显示该线程的两个评论：

来自@apaszke：

[...] GPU异步执行所有操作，因此您需要插入适当的障碍以使您的基准正确

来自@ngimel：

我相信Cublas手柄现在是懒惰的，这意味着需要Cublas的首次操作将具有创建Cublas Hander的开销，其中包括一些内部分配。因此，除了调用某些功能在正时循环之前调用一些功能外，别无其他方法。

基本上，您必须进行synchronize()才能进行适当的测量：

import torch
x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()
%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU时代：用户288毫秒，系统：191 ms，总计：479毫秒
壁时间：492 ms

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()
%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU时代：用户237 MS，系统：231 MS，总计：468 MS
壁时间：469 ms

文档说：

torch.cuda.synchronize()

等待CUDA设备上所有流中的所有内核完成。

实际上，这告诉Python：停止，然后等到操作完全完成。

否则，发出命令后立即返回%time。

这将是测试时间的正确方法。注意两次torch.cuda.synchronize()首先等待张量在CUDA上移动，其次要等到命令在GPU上完成。

import torch
x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
torch.cuda.synchronize()
%timeit -n 10 y = x.matmul(w.t()); torch.cuda.synchronize() #10 loops, best of 3: 531 ms per loop

gpu内存缓存我会猜。每次运行后，尝试Torch.cuda.ement_cache((。

相关内容

最新更新

热门标签：