为什么在Pytorch中打印GPU张量的值需要这么长时间?

我写了这个pytorch程序，在GPU上计算一个5000*5000矩阵乘法，100次迭代。

import torch
import numpy as np
import time
N = 5000
x1 = np.random.rand(N, N)
######## a 5000*5000 matrix multiplication on GPU, 100 iterations #######
x2 = torch.tensor(x1, dtype=torch.float32).to("cuda:0")
start_time = time.time()
for n in range(100):
G2 = x2.t() @ x2
print(G2.size())
print("It takes", time.time() - start_time, "seconds to compute")
print("G2.device:", G2.device)
start_time2 = time.time()
# G4 = torch.zeros((5,5),device="cuda:0")
G4 = G2[:5, :5]
print("G4.device:", G4.device)
print("G4======", G4)
# G5=G4.cpu()
# print("G5.device:",G5.device)
print("It takes", time.time() - start_time2, "seconds to transfer or display")

下面是我笔记本电脑上的结果:

火炬。大小((5000、5000))

计算
耗时0.22243595123291016秒

G2.device: cuda: 0

G4.device: cuda: 0

G4 = = = = = =张量([[1636.3195,1227.1913,1252.6871,1242.4584,1235.8160),[1227.1913, 1653.0522, 1260.2621, 1246.9526, 1250.2871]，[1252.6871, 1260.2621, 1685.1147, 1257.2373, 1266.2213]，[1242.4584, 1246.9526, 1257.2373, 1660.5951, 1239.5414]，[1235.8160, 1250.2871, 1266.2213, 1239.5414, 1670.0034]]，设备= cuda: 0)

传输或显示时间为60.13639569282532秒
进程结束，退出码0

我很困惑为什么要花这么多时间在GPU上显示变量G5，因为它只有5*5的大小。顺便说一句，我使用"G5=G4.cpu()"把GPU上的变量转移到CPU上，也要花很多时间。

我的开发环境(相当旧的笔记本电脑):

pytorch 1.0.0
8.0 CUDA
Nvidia GeForce GT 730m
Windows 10 Professional

当增加迭代次数时，计算时间不明显增加，但传输或显示明显增加，为什么?谁来翻译一下，非常感谢。

Pytorch CUDA操作是异步的。GPU张量上的大多数操作实际上是非阻塞的，直到请求派生结果。这意味着，在您请求张量的CPU版本之前，像矩阵乘法这样的命令基本上是与您的代码并行处理的。当你停止计时器时，并不能保证操作已经完成。你可以在文档中了解更多。

要正确地对代码块进行计时，您应该添加对torch.cuda.synchronize的调用。这个函数应该被调用两次，一次是在启动计时器之前，另一次是在停止计时器之前。在分析代码之外，你应该避免调用这个函数，因为它可能会降低整体性能。

相关内容

最新更新

热门标签：