fmad=false 提供了良好的性能

来自 Nvidia 发行说明：

 The nvcc compiler switch, --fmad (short name: -fmad), to control the contraction of    
 floating-point multiplies and add/subtracts into floating-point multiply-add   
 operations (FMAD, FFMA, or DFMA) has been added: 
 --fmad=true and --fmad=false enables and disables the contraction respectively. 
 This switch is supported only when the --gpu-architecture option is set with     
 compute_20, sm_20, or higher. For other architecture classes, the contraction is     
  always enabled. 
 The --use_fast_math option implies --fmad=true, and enables the contraction.

我有两个内核 - 一个是纯粹的计算绑定，有很多乘法，而另一个是内存绑定的。当我执行-fmad=false时，我注意到我的计算密集型内核的性能持续提高（约 5%）......当我为内存绑定内核关闭它时，性能下降的百分比大致相同。因此，FMA 更适合我的内存绑定内核，但我的计算绑定内核可以通过关闭它来挤压一点性能。可能是什么原因？我的设备是 M2090，我使用的是 CUDA 4.2。

完整编译选项： -arch,sm_20,-ftz=true,-prec-div=false,-prec-sqrt=false,-use_fast_math,-fmad=false（或者我只是删除fmad=false因为无论如何这是默认设置。

使用 FMA 可能会略微增加寄存器压力，因为必须同时使用三个源操作数。因此，打开/关闭FMA生成会导致指令调度和寄存器分配的微小差异，进而导致较小的性能差异。对于具有许多乘加习语的计算绑定内核，-fmad=true 应该会产生显着的性能差异，但正如您所说，您的内核由乘法主导，因此使用 FMA 几乎没有好处，任何收益都可能被寄存器压力/指令调度方面抵消

相关内容

最新更新

热门标签：