cudaGetDeviceProperties()
API 调用似乎并没有告诉我们太多关于全局内存延迟的信息(甚至不是典型值,或者最小/最大对等)。
编辑:当我说延迟时,我实际上是指必须从主设备内存读取数据的各种情况的不同延迟。所以,如果我们拿这篇论文,它实际上是 6 个数字:{ TLB L1 命中、TLB L2 命中、TLB 未命中 } x L1 数据缓存打开、关闭 }。
Q1:除了自己测量之外,有没有办法获得这些数字?
即使是基于 SM 版本、SM 时钟和内存时钟的经验法则计算也可能做到。
我会问第二个问题,即:
Q2:如果没有,是否有实用程序可以为您执行此操作?
(尽管这可能是该网站的题外话。
cudaDeviceProperties()
的目的,就像 x86 CPU 上的等效cpuid
工具一样,返回相关的微架构参数。与 CPU 一样,即使微架构参数相同,GPU 的性能特征也可能有所不同,例如由于不同的时钟频率,或者由于附加的 DRAM 规格不同,以及这些与处理器内部各种缓冲和缓存机制交互的方式。一般来说,没有一个单一的"内存延迟"可以分配,我也不知道有一种方法可以从已知的微架构参数中计算可能的范围。
因此,在CPU和GPU上,必须利用复杂的微基准来确定性能参数,例如DRAM延迟。如何为每个期望的参数构建这样的微基准太宽泛,无法在这里介绍。已经发表了多篇论文,详细讨论了有关NVDIA GPU的问题。最早的相关出版物之一是(在线草稿):
Wong, Henry, et al. "通过微基准测试揭开GPU微架构的神秘面纱"。论文集:2010年IEEE系统和软件性能分析国际研讨会(ISPASS),第235-246页
最近的一项工作包括对开普勒架构的报道(在线草案):
梅欣欣, 楚晓文."通过微基准测试剖析 GPU 内存层次结构。"Arxiv 手稿,2015 年 9 月,第 1-14 页
除了构建自己的微基准测试之外,还必须依赖已发布的结果,例如上面引用的结果,用于特定GPU的各种特定于实现的性能参数。
在针对 GPU 平台进行优化的多年中,我一直不需要了解此类数据,一般来说,CUDA 分析器的性能指标应该足以跟踪特定的瓶颈。
A1:关于 GPU 微架构延迟态势的良好阅读,包括您询问的数字*,都在这 http://ufdc.ufl.edu/UFE0043739/00001
Fig. 4.1
[符号多 SM 架构] 和 Table 4-1.
PTX 指令类别、启动和执行周期在开始构建 GPU 内核之前,每当设计成本/收益方程式时,都应该牢记这一点。
阿姆达尔定律最终有助于定量决定 CPU-GPU-CPU 管道是否可以更快地完成任务。
A2:
GPU 指令代码模拟器是接收特定微架构必须花费的预期GPU-CLK
数的方法(至少,因为更多的 GPU 内核可能同时使用硬件资源,从而扩展体内观察到的端到端延迟)
数字很重要。
Category GPU
| Hardware
| Unit
| | Throughput
| | | Execution
| | | Latency
| | | | PTX instructions Note
|____________________________|____________|_______________|__________________|_____________________________________________________________________
Load_shared LSU 2 + 30 ld, ldu Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
Load_global LSU 2 + 600 ld, ldu, prefetch, prefetchu Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
Load_local LSU 2 + 600 ld, ldu, prefetch, prefetchu Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
Load_const LSU 2 + 600 ld, ldu Note, .ss = .const; .vec and .type determine the size of load
Load_param LSU 2 + 30 ld, ldu Note, .ss = .param; .vec and .type determine the size of load
| |
Store_shared LSU 2 + 30 st Note, .ss = .shared; .vec and .type determine the size of store
Store_global LSU 2 + 600 st Note, .ss = .global; .vec and .type determine the size of store
Store_local LSU 2 + 600 st Note, .ss = .local; .vec and .type determine the size of store
Read_modify_write_shared LSU 2 + 600 atom, red Note, .space = shared; .type determine the size
Read_modify_write_global LSU 2 + 600 atom, red Note, .space = global; .type determine the size
| |
Texture LSU 2 + 600 tex, txq, suld, sust, sured, suq
| |
Integer ALU 2 + 24 add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
| | Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
| |
Float_single ALU 2 + 24 testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max Note, these Float-single inst. with type = { .f32 };
Float_double ALU 1 + 48 testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max Note, these Float-double inst. with type = { .f64 };
Special_single SFU 8 + 48 rcp, sqrt, rsqrt, sin, cos, lg2, ex2 Note, these special-single with type = { .f32 };
Special_double SFU 8 + 72 rcp, sqrt, rsqrt, sin, cos, lg2, ex2 Note, these special-double with type = { .f64 };
|
Logical ALU 2 + 24 and, or, xor, not, cnot, shl, shr
Control ALU 2 + 24 bra, call, ret, exit
|
Synchronization ALU 2 + 24 bar, member, vote
Compare & Select ALU 2 + 24 set, setp, selp, slct
|
Conversion ALU 2 + 24 Isspacep, cvta, cvt
Miscellanies ALU 2 + 24 brkpt, pmevent, trap
Video ALU 2 + 24 vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset
+====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i
| |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
| |||||||||||||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput
| |||||||||||||||||| ~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
| ||||||||||||||||||
| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 )
| | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes
| | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
| ||||||||||||||||||
+====================|
| PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
| smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
| +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
| | ^^^^^^^^
| +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
| | ^^^^^^^^
| ~ +20 [ns] @1147 MHz FERMI ^^^^^^^^
| SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
| max 63 for CC-3.0 - about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
| max 128 for CC-1.x PAR -- ||||||||~~~|
| max 255 for CC-3.5 PAR -- ||||||||||||||||||~~~~~~|
|
| smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) )
| with about 8.0 TB/s BW [C:Pg.46]
| 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
| 0.1 TB/s BW gloMEM___
| ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========| DEVICE:3 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======| DEVICE:2 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====| DEVICE:1 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
! | | + |
o | texMEM___|____________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
| | | + | |
| texL2cache_| .| _ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | 256_KB|
| | | + | ^ |
| | | + | ^ |
| | | + | ^ |
| texL1cache_| .| _ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | _ _ _ _ _^ 5_KB|
| | | + ^ ^ ^ |
| shaMEM + conL3cache_| | _ _ _ _ conL3cache +220 [GPU_CLKs] ^ ^ ^ 32_KB|
| | | ^ + ^ ^ ^ |
| | | ^ + ^ ^ ^ |
| ______________________|__________________________________|_______________^__________+________________________________________________________________________________________|
| +220 [GPU-CLKs]_| |_ _ _ ___| _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
| L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
| L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\ ________________________________________________+ 40 [GPU_CLKs]_____________________________________________________________________________|
| L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\_________________________________________________________+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB|
| on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~:________]
|CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________]
|2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________]
|1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________]
|3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________]
| per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________]
| Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ..............
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[0_______^:~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W1..............|t[ F_______^___________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
|
| ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
| / CC-2.0|||||||||||||||||||||||||| ~masked ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| / 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
| / 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
|_______________/ ______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
|~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
| | //
| RR-mode //
| GREEDY-mode //
| ________________//
| ______________/
GPU 集群自我报告:
* FERMI
* GF100 Server/HPC-GPU PCIe-2.0-16x
* GPU_CLK 1.15 GHz [Graphics 575 MHz]
* 3072 MB GDDR5 773MHz + ECC-correction
*
* 448-CUDA-COREs [SMX]-s --> 14 SM * warpSize == 448
* 48-ROPs
* 56-TEX-Units, 400 MHz RAMDAC
CUDA API reports .self to operate an API-Driver-version [5000]
on RT-version [5000]
CUDA API reports .self to operate a limited FIFO [Host] <-|buffer| <-[Device] of a size of 1048576 [B]
CUDA API reports .self to operate a limited HEAP for[Device]-side Dynamic __global__ Memory Allocations in a size of 8388608 [B] ( 8 MB if not specified in malloc() call )
CUDA API reports .self to operate cudaCreateStreamWithPriority() QUEUEs
with <_stream_PRIO_LOW____> == 725871085
with <_stream_PRIO_HIGH___> == 0
CUDA Device:0_ has <_compute capability_> == 2.0.
CUDA Device:0_ has [ Tesla M2050] .name
CUDA Device:0_ has [ 14] .multiProcessorCount [ Number of multiprocessors on device ]
CUDA Device:0_ has [ 2817982464] .totalGlobalMem [ __global__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 65536] .totalConstMem [ __constant__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 1147000] .clockRate [ GPU_CLK frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 32] .warpSize [ GPU WARP size in threads ]
CUDA Device:0_ has [ 1546000] .memoryClockRate [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 384] .memoryBusWidth [ GPU_DDR Global memory bus width in bits [b] ]
CUDA Device:0_ has [ 1024] .maxThreadsPerBlock [ MAX Threads per Block ]
CUDA Device:0_ has [ 32768] .regsPerBlock [ MAX number of 32-bit Registers available per Block ]
CUDA Device:0_ has [ 1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
CUDA Device:0_ has [ 786432] .l2CacheSize
CUDA Device:0_ has [ 49152] .sharedMemPerBlock [ __shared__ memory available per Block in Bytes [B] ]
CUDA Device:0_ has [ 2] .asyncEngineCount [ a number of asynchronous engines ]
CUDA Device:0_ has [ 1] .deviceOverlap [ if Device can concurrently copy memory and execute a kernel ]
CUDA Device:0_ has [ 0] .kernelExecTimeoutEnabled [ if there is a run time limit on kernel exec-s ]
CUDA Device:0_ has [ 1] .concurrentKernels [ if Device can possibly execute multiple kernels concurrently ]
CUDA Device:0_ has [ 1] .canMapHostMemory [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
CUDA Device:0_ has [ 3] .computeMode [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
CUDA Device:0_ has [ 1] .ECCEnabled [ if has ECC support enabled ]
CUDA Device:0_ has [ 2147483647] .memPitch [ MAX pitch in bytes allowed by memory copies [B] ]
CUDA Device:0_ has [ 65536] .maxSurface1D [ MAX 1D surface size ]
CUDA Device:0_ has [ 32768] .maxSurfaceCubemap [ MAX Cubemap surface dimensions ]
CUDA Device:0_ has [ 65536] .maxTexture1D [ MAX 1D Texture size ]
CUDA Device:0_ has [ 0] .pciBusID [ PCI bus ID of the device ]
CUDA Device:0_ has [ 0] .integrated [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
CUDA Device:0_ has [ 1] .unifiedAddressing [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]
CUDA Device:1_ has <_compute capability_> == 2.0.
CUDA Device:1_ has [ Tesla M2050] .name
CUDA Device:1_ has [ 14] .multiProcessorCount [ Number of multiprocessors on device ]
CUDA Device:1_ has [ 2817982464] .totalGlobalMem [ __global__ memory available on device in Bytes [B] ]
CUDA Device:1_ has [ 65536] .totalConstMem [ __constant__ memory available on device in Bytes [B] ]
CUDA Device:1_ has [ 1147000] .clockRate [ GPU_CLK frequency in kilohertz [kHz] ]
CUDA Device:1_ has [ 32] .warpSize [ GPU WARP size in threads ]
CUDA Device:1_ has [ 1546000] .memoryClockRate [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
CUDA Device:1_ has [ 384] .memoryBusWidth [ GPU_DDR Global memory bus width in bits [b] ]
CUDA Device:1_ has [ 1024] .maxThreadsPerBlock [ MAX Threads per Block ]
CUDA Device:1_ has [ 32768] .regsPerBlock [ MAX number of 32-bit Registers available per Block ]
CUDA Device:1_ has [ 1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
CUDA Device:1_ has [ 786432] .l2CacheSize
CUDA Device:1_ has [ 49152] .sharedMemPerBlock [ __shared__ memory available per Block in Bytes [B] ]
CUDA Device:1_ has [ 2] .asyncEngineCount [ a number of asynchronous engines ]
CUDA Device:1_ has [ 1] .deviceOverlap [ if Device can concurrently copy memory and execute a kernel ]
CUDA Device:1_ has [ 0] .kernelExecTimeoutEnabled [ if there is a run time limit on kernel exec-s ]
CUDA Device:1_ has [ 1] .concurrentKernels [ if Device can possibly execute multiple kernels concurrently ]
CUDA Device:1_ has [ 1] .canMapHostMemory [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
CUDA Device:1_ has [ 3] .computeMode [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
CUDA Device:1_ has [ 1] .ECCEnabled [ if has ECC support enabled ]
CUDA Device:1_ has [ 2147483647] .memPitch [ MAX pitch in bytes allowed by memory copies [B] ]
CUDA Device:1_ has [ 65536] .maxSurface1D [ MAX 1D surface size ]
CUDA Device:1_ has [ 32768] .maxSurfaceCubemap [ MAX Cubemap surface dimensions ]
CUDA Device:1_ has [ 65536] .maxTexture1D [ MAX 1D Texture size ]
CUDA Device:1_ has [ 0] .pciBusID [ PCI bus ID of the device ]
CUDA Device:1_ has [ 0] .integrated [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
CUDA Device:1_ has [ 1] .unifiedAddressing [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]