pytorch中的参数和触发器数量较低,但推理时间可能较大吗



我使用Pytorch计算了网络的失败次数。我使用了'hop'库中的函数'profile'。

在我的实验中我的网络显示

浮点数:619.038M参数:4.191M推断时间:25.911

与我的实验不同,我会用ResNet50检查失败和参数,这表明

浮点数:1.315G参数:2659.6M推断时间:8.553545

是否可能是推理时间较大而flops较低?或者是否存在"profile"函数无法测量某些函数的失败?然而,使用fvcore.nn中的FlopCountAnalysisptflopsget_model_complexity_info得出了类似的结果

这是我使用Pytorch测量推理时间的代码。

model.eval()
model.cuda()
dummy_input = torch.randn(1,3,32,32).cuda()
#flops = FlopCountAnalysis(model, dummy_input)
#print(flop_count_table(flops))
#print(flops.total())
macs, params = profile(model, inputs=(dummy_input,))
macs, params = clever_format([macs, params], "%.3f")
print('Flops:',macs)
print('Parameters:',params)
starter, ender = torch.cuda.Event(enable_timing=True), 
torch.cuda.Event(enable_timing=True)
repetitions = 300
timings=np.zeros((repetitions,1))
for _ in range(10):
_ = model(dummy_input)
# MEASURE PERFORMANCE
with torch.no_grad():
for rep in range(repetitions):
starter.record()
_ = model(dummy_input)
ender.record()
# WAIT FOR GPU SYNC
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)
timings[rep] = curr_time
print('time(s) :',np.average(timings))

这是绝对正常的情况。问题是FLOPS(或MAC(是理论上的衡量标准,当你想忽略一些硬件/软件优化,导致不同的操作在不同的硬件上工作得更快/更慢时,它们可能会很有用。

例如,在神经网络的情况下,不同的架构将具有不同的CPU/GPU利用率。让我们考虑两个简单的体系结构,它们具有几乎相同数量的参数/FLOP:

  1. 深度网络:
layers = [nn.Conv2d(3, 16, 3)]
for _ in range(12):
layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
deep_model = nn.Sequential(*layers)
  1. 广域网wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))

现代GPU允许您并行化大量简单操作。但是,当你有深度网络时,你需要知道层[i]的输出,以计算层[i+1]的输出。因此,它变成了降低硬件利用率的阻塞因素。

完整示例:

import numpy as np
import torch
from thop import clever_format, profile
from torch import nn

def measure(model, name):
model.eval()
model.cuda()
dummy_input = torch.randn(1, 3, 64, 64).cuda()
macs, params = profile(model, inputs=(dummy_input,), verbose=0)
macs, params = clever_format([macs, params], "%.3f")
print("<" * 50, name)
print("Flops:", macs)
print("Parameters:", params)
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(
enable_timing=True
)
repetitions = 300
timings = np.zeros((repetitions, 1))
for _ in range(10):
_ = model(dummy_input)
# MEASURE PERFORMANCE
with torch.no_grad():
for rep in range(repetitions):
starter.record()
_ = model(dummy_input)
ender.record()
# WAIT FOR GPU SYNC
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)
timings[rep] = curr_time
print("time(ms) :", np.average(timings))

layers = [nn.Conv2d(3, 16, 3)]
for _ in range(12):
layers.extend([nn.Conv2d(16, 16, 3, padding=1)])
deep_model = nn.Sequential(*layers)
measure(deep_model, "My deep model")
wide_model = nn.Sequential(nn.Conv2d(3, 1024, 3))
measure(wide_model, "My wide model")

结果:

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My deep model
Flops: 107.940M
Parameters: 28.288K
time(ms) : 0.6160109861691793
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< My wide model
Flops: 106.279M
Parameters: 28.672K
time(ms) : 0.1514971748739481

正如你所看到的,这些模型具有相似数量的参数/触发器,但深度网络的计算时间是原来的4倍。

这只是当参数和触发器的数量较低时推理时间较大的可能原因之一。您可能需要考虑其他底层硬件/软件优化。

最新更新