在CUDA GPU上运行Pytorch量化模型



我很困惑是否可以在CUDA上运行int8量化模型,或者你只能在CUDA上训练一个量化模型,并使用fakequantise部署在另一个后端(如CPU)上。

我想用实际的int8指令而不是FakeQuantised float32指令在CUDA上运行模型,并享受效率提升。Pytorch文档对此奇怪地不明确。如果有可能在CUDA上使用不同的框架(如TensorFlow)运行量化模型,我很想知道。

这是准备量化模型的代码(使用训练后量化)。模型是带nn的普通CNN。Conv2d和nn。LeakyRelu和nn。MaxPool模块:

model_fp = torch.load(models_dir+net_file)
model_to_quant = copy.deepcopy(model_fp)
model_to_quant.eval()
model_to_quant = quantize_fx.fuse_fx(model_to_quant)
qconfig_dict = {"": torch.quantization.get_default_qconfig('qnnpack')}
model_prepped = quantize_fx.prepare_fx(model_to_quant, qconfig_dict)
model_prepped.eval()
model_prepped.to(device='cuda:0')
train_data   = ImageDataset(img_dir, train_data_csv, 'cuda:0')
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, pin_memory=True)
for i, (input, _) in enumerate(train_loader):
if i > 1: break
print('batch', i+1, end='r')
input = input.to('cuda:0')
model_prepped(input)

这实际上量化了模型:

model_quantised = quantize_fx.convert_fx(model_prepped)
model_quantised.eval()

这是在CUDA上运行量化模型的尝试,并引发NotImplementedError,当我在CPU上运行它时,它工作得很好:

model_quantised = model_quantised.to('cuda:0')
for i, _ in train_loader:
input = input.to('cuda:0')
out = model_quantised(input)
print(out, out.shape)
break

错误:

Traceback (most recent call last):
File "/home/adam/Desktop/thesis/Ship Detector/quantisation.py", line 54, in <module>
out = model_quantised(input)
File "/home/adam/.local/lib/python3.9/site-packages/torch/fx/graph_module.py", line 513, in wrapped_call
raise e.with_traceback(None)
NotImplementedError: Could not run 'quantized::conv2d.new' with arguments from the 'QuantizedCUDA' backend. 
This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). 
If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 
'quantized::conv2d.new' is only available for these backends: [QuantizedCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, Tracer, Autocast, Batched, VmapMode].

从[this][1]博客来看,看起来你不能在GPU上运行量化模型。

PyTorch中的

量化目前仅限cpu。量化不是cpu专用技术(例如NVIDIA的TensorRT)可用于在GPU上实现量化)。而GPU上的推理时间为通常已经"足够快"了,而且cpu更有吸引力大规模模型服务器部署(由于复杂的成本因素)都超出了本文的讨论范围)。因此,在PyTorch中1.6,只有CPU后端在原生API中可用。

[1]: https://spell.ml/blog/pytorch-quantization-X8e7wBAAACIAHPhT: ~:文本=量子化% 20 % 20 pytorch % 20目前% 20,% 20实施% 20量子化% 20 % 20 gpu)。