如何在没有测量的情况下获取/计算 GPU 的内存延迟



cudaGetDeviceProperties() API 调用似乎并没有告诉我们太多关于全局内存延迟的信息(甚至不是典型值,或者最小/最大对等)。

编辑:当我说延迟时,我实际上是指必须从主设备内存读取数据的各种情况的不同延迟。所以,如果我们拿这篇论文,它实际上是 6 个数字:{ TLB L1 命中、TLB L2 命中、TLB 未命中 } x L1 数据缓存打开、关闭 }。

Q1:除了自己测量之外,有没有办法获得这些数字?
即使是基于 SM 版本、SM 时钟和内存时钟的经验法则计算也可能做到。

我会问第二个问题,即:
Q2:如果没有,是否有实用程序可以为您执行此操作?
(尽管这可能是该网站的题外话。

cudaDeviceProperties() 的目的,就像 x86 CPU 上的等效cpuid工具一样,返回相关的微架构参数。与 CPU 一样,即使微架构参数相同,GPU 的性能特征也可能有所不同,例如由于不同的时钟频率,或者由于附加的 DRAM 规格不同,以及这些与处理器内部各种缓冲和缓存机制交互的方式。一般来说,没有一个单一的"内存延迟"可以分配,我也不知道有一种方法可以从已知的微架构参数中计算可能的范围。

因此,在CPU和GPU上,必须利用复杂的微基准来确定性能参数,例如DRAM延迟。如何为每个期望的参数构建这样的微基准太宽泛,无法在这里介绍。已经发表了多篇论文,详细讨论了有关NVDIA GPU的问题。最早的相关出版物之一是(在线草稿):

Wong, Henry, et al. "通过微基准测试揭开GPU微架构的神秘面纱"。论文集:2010年IEEE系统和软件性能分析国际研讨会(ISPASS),第235-246页

最近的一项工作包括对开普勒架构的报道(在线草案):

梅欣欣, 楚晓文."通过微基准测试剖析 GPU 内存层次结构。"Arxiv 手稿,2015 年 9 月,第 1-14 页

除了构建自己的微基准测试之外,还必须依赖已发布的结果,例如上面引用的结果,用于特定GPU的各种特定于实现的性能参数。

在针对 GPU 平台进行优化的多年中,我一直不需要了解此类数据,一般来说,CUDA 分析器的性能指标应该足以跟踪特定的瓶颈。

A1:关于 GPU 微架构延迟态势的良好阅读,包括您询问的数字*,都在 http://ufdc.ufl.edu/UFE0043739/00001

Fig. 4.1 [符号多 SM 架构] 和 Table 4-1. PTX 指令类别、启动和执行周期在开始构建 GPU 内核之前,每当设计成本/收益方程式时,都应该牢记这一点。

阿姆达尔定律最终有助于定量决定 CPU-GPU-CPU 管道是否可以更快地完成任务。


A2: GPU 指令代码模拟器是接收特定微架构必须花费的预期GPU-CLK数的方法(至少,因为更多的 GPU 内核可能同时使用硬件资源,从而扩展体内观察到的端到端延迟)

数字很重要。

   Category                     GPU
   |                            Hardware
   |                            Unit
   |                            |            Throughput
   |                            |            |               Execution
   |                            |            |               Latency
   |                            |            |               |                  PTX instructions                                                      Note 
   |____________________________|____________|_______________|__________________|_____________________________________________________________________
   Load_shared                  LSU          2               +  30              ld, ldu                                                               Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
   Load_global                  LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_local                   LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_const                   LSU          2               + 600              ld, ldu                                                               Note, .ss = .const; .vec and .type determine the size of load
   Load_param                   LSU          2               +  30              ld, ldu                                                               Note, .ss = .param; .vec and .type determine the size of load
   |                            |                              
   Store_shared                 LSU          2               +  30              st                                                                    Note, .ss = .shared; .vec and .type determine the size of store
   Store_global                 LSU          2               + 600              st                                                                    Note, .ss = .global; .vec and .type determine the size of store
   Store_local                  LSU          2               + 600              st                                                                    Note, .ss = .local; .vec and .type determine the size of store
   Read_modify_write_shared     LSU          2               + 600              atom, red                                                             Note, .space = shared; .type determine the size
   Read_modify_write_global     LSU          2               + 600              atom, red                                                             Note, .space = global; .type determine the size
   |                            |                              
   Texture                      LSU          2               + 600              tex, txq, suld, sust, sured, suq
   |                            |                              
   Integer                      ALU          2               +  24              add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
   |                            |                                                                                                                     Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
   |                            |                              
   Float_single                 ALU          2               +  24              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-single inst. with type = { .f32 };
   Float_double                 ALU          1               +  48              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-double inst. with type = { .f64 };
   Special_single               SFU          8               +  48              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-single with type = { .f32 };
   Special_double               SFU          8               +  72              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-double with type = { .f64 };
   |                                                           
   Logical                      ALU          2               +  24              and, or, xor, not, cnot, shl, shr
   Control                      ALU          2               +  24              bra, call, ret, exit
   |                                                           
   Synchronization              ALU          2               +  24              bar, member, vote
   Compare & Select             ALU          2               +  24              set, setp, selp, slct
   |                                                           
   Conversion                   ALU          2               +  24              Isspacep, cvta, cvt
   Miscellanies                 ALU          2               +  24              brkpt, pmevent, trap
   Video                        ALU          2               +  24              vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset

+====================| + 11-12 [usec] XFER-LATENCY-up   HostToDevice    ~~~ same as Intel X48 / nForce 790i
|   |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
|   |||||||||||||||||| ~  5.5 GB/sec XFER-BW-up                         ~~~ same as DDR2/DDR3 throughput
|   |||||||||||||||||| ~  5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD      ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
|   ||||||||||||||||||
|   | PCIe-2.0 ( 4x) | ~ 4 GB/s over  4-Lanes ( PORT #2  )
|   | PCIe-2.0 ( 8x) | ~16 GB/s over  8-Lanes
|   | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
|   ||||||||||||||||||
+====================|
|                                                                                                                                        PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
|                                                       smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
|                                                                                                              +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
|                                                                                                                          |                    ^^^^^^^^
|                                                                                                                       +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
|                                                                                                                          |                    ^^^^^^^^
|                                                                                                                   ~  +20 [ns] @1147 MHz FERMI ^^^^^^^^
|                                                             SM-REGISTERs/thread: max  63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
|                                                                                  max  63 for CC-3.0 -          about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
|                                                                                  max 128 for CC-1.x                                    PAR -- ||||||||~~~|
|                                                                                  max 255 for CC-3.5                                    PAR -- ||||||||||||||||||~~~~~~|
|
|                                                       smREGs___BW                                 ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE <<  -Xptxas -v          || nvcc -maxrregcount ( w|w/o spillover(s) )
|                                                                with about 8.0  TB/s BW            [C:Pg.46]
|                                                                           1.3  TB/s BW shaMEM___  4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
|                                                                           0.1  TB/s BW gloMEM___
|         ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========|   DEVICE:3 PERSISTENT                          gloMEM___
|       _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======|   DEVICE:2 PERSISTENT                          gloMEM___
|     _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====|   DEVICE:1 PERSISTENT                          gloMEM___
|   _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==|   DEVICE:0 PERSISTENT                          gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
!  |                                                         |                                                                +                                                                                           |
o  |                                                texMEM___|____________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
   |                                                         |                                  |                           +                                               |                                          |
   |                                              texL2cache_|                                 .| _ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |                                    256_KB|
   |                                                         |                                  |                           +                                 |            ^                                          |
   |                                                         |                                  |                           +                                 |            ^                                          |
   |                                                         |                                  |                           +                                 |            ^                                          |
   |                                              texL1cache_|                                 .|     _ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ |   _ _ _ _ _^                                      5_KB|
   |                                                         |                                  |                           +                         ^      ^            ^                                         |
   |                                     shaMEM + conL3cache_|                                  |        _ _ _ _ conL3cache +220 [GPU_CLKs]           ^      ^            ^                                    32_KB|
   |                                                         |                                  |               ^          +                         ^      ^            ^                                         |
   |                                                         |                                  |               ^          +                         ^      ^            ^                                         |
   |                                   ______________________|__________________________________|_______________^__________+________________________________________________________________________________________|
   |                  +220 [GPU-CLKs]_|           |_ _ _  ___|           _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  _ _ _ __ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
   | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB  L2_|_ _ _   __|\           _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  _ _ _ __ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
   | L1-on-re-use-only +40 [GPU-CLKs]_|  8 KB  L1_|_ _ _    _|\          ________________________________________________+ 40 [GPU_CLKs]_____________________________________________________________________________|
   | L1-on-re-use-only + 8 [GPU-CLKs]_|  2 KB  L1_|__________|\\_________________________________________________________+  8 [GPU_CLKs]_________________________________________________________conL1cache      2_KB|
   |     on-chip|smREG +22 [GPU-CLKs]_|           |t[0_______^:~~~~~~~~~~~~~~~~:________]
   |CC-  MAX    |_|_|_|_|_|_|_|_|_|_|_|           |t[1_______^                  :________]
   |2.x   63    |_|_|_|_|_|_|_|_|_|_|_|           |t[2_______^                  :________] 
   |1.x  128    |_|_|_|_|_|_|_|_|_|_|_|           |t[3_______^                  :________]
   |3.5  255 REGISTERs|_|_|_|_|_|_|_|_|           |t[4_______^                  :________]
   |         per|_|_|_|_|_|_|_|_|_|_|_|           |t[5_______^                  :________]
   |         Thread_|_|_|_|_|_|_|_|_|_|           |t[6_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[7_______^     1stHalf-WARP :________]______________
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 9_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ A_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ B_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ C_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ D_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ E_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|       W0..|t[ F_______^____________WARP__:________]_____________
   |            |_|_|_|_|_|_|_|_|_|_|_|         ..............             
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[0_______^:~~~~~~~~~~~~~~~:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[1_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[2_______^                 :________] 
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[3_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[4_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[5_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[6_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[7_______^    1stHalf-WARP :________]______________
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 9_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ A_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ B_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ C_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ D_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ E_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|       W1..............|t[ F_______^___________WARP__:________]_____________
   |            |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
   |
   |                   ________________          °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
   |                  /                   CC-2.0|||||||||||||||||||||||||| ~masked  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |                 /                    1.hW  ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
   |                /                     2.hW  |^|^|^|^|^|^|^|^|^|^|^|^|^          |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
   |_______________/                      ______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
   |~~~~~~~~~~~~~~/ SM:0.warpScheduler    /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
   |                        |           //
   |                        RR-mode    //
   |                    GREEDY-mode   //
   |                 ________________//
   |                   ______________/

GPU 集群自我报告:

* FERMI
* GF100 Server/HPC-GPU PCIe-2.0-16x
*                      GPU_CLK 1.15 GHz [Graphics 575 MHz]
*                      3072 MB GDDR5 773MHz + ECC-correction
* 
*                       448-CUDA-COREs [SMX]-s --> 14 SM * warpSize == 448
*                        48-ROPs
*                        56-TEX-Units, 400 MHz RAMDAC
CUDA API reports .self to operate an API-Driver-version [5000]
                                  on RT-version         [5000]
CUDA API reports .self to operate a limited FIFO [Host] <-|buffer| <-[Device] of a size of 1048576 [B]
CUDA API reports .self to operate a limited HEAP                  for[Device]-side Dynamic __global__ Memory Allocations in a size of 8388608 [B] ( 8 MB if not specified in malloc() call )
CUDA API reports .self to operate cudaCreateStreamWithPriority() QUEUEs
                                                  with <_stream_PRIO_LOW____> == 725871085
                                                  with <_stream_PRIO_HIGH___> == 0

  CUDA Device:0_ has <_compute capability_> == 2.0.
  CUDA Device:0_ has [    Tesla M2050] .name
  CUDA Device:0_ has [             14] .multiProcessorCount         [ Number of multiprocessors on device ]
  CUDA Device:0_ has [     2817982464] .totalGlobalMem              [ __global__   memory available on device in Bytes [B] ]
  CUDA Device:0_ has [          65536] .totalConstMem               [ __constant__ memory available on device in Bytes [B] ]
  CUDA Device:0_ has [        1147000] .clockRate                   [ GPU_CLK frequency in kilohertz [kHz] ]
  CUDA Device:0_ has [             32] .warpSize                    [ GPU WARP size in threads ]
  CUDA Device:0_ has [        1546000] .memoryClockRate             [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
  CUDA Device:0_ has [            384] .memoryBusWidth              [ GPU_DDR Global memory bus width in bits [b] ]
  CUDA Device:0_ has [           1024] .maxThreadsPerBlock          [ MAX Threads per Block ]
  CUDA Device:0_ has [          32768] .regsPerBlock                [ MAX number of 32-bit Registers available per Block ]
  CUDA Device:0_ has [           1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
  CUDA Device:0_ has [         786432] .l2CacheSize
  CUDA Device:0_ has [          49152] .sharedMemPerBlock           [ __shared__   memory available per Block in Bytes [B] ]
  CUDA Device:0_ has [              2] .asyncEngineCount            [ a number of asynchronous engines ]
  CUDA Device:0_ has [              1] .deviceOverlap               [ if Device can concurrently copy memory and execute a kernel ]
  CUDA Device:0_ has [              0] .kernelExecTimeoutEnabled    [ if there is a run time limit on kernel exec-s ]
  CUDA Device:0_ has [              1] .concurrentKernels           [ if Device can possibly execute multiple kernels concurrently ]
  CUDA Device:0_ has [              1] .canMapHostMemory            [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
  CUDA Device:0_ has [              3] .computeMode                 [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
  CUDA Device:0_ has [              1] .ECCEnabled                  [ if has ECC support enabled ]
  CUDA Device:0_ has [     2147483647] .memPitch                    [ MAX pitch in bytes allowed by memory copies [B] ]
  CUDA Device:0_ has [          65536] .maxSurface1D                [ MAX 1D surface size ]
  CUDA Device:0_ has [          32768] .maxSurfaceCubemap           [ MAX Cubemap surface dimensions ]
  CUDA Device:0_ has [          65536] .maxTexture1D                [ MAX 1D Texture size ]
  CUDA Device:0_ has [              0] .pciBusID                    [ PCI bus ID of the device ]
  CUDA Device:0_ has [              0] .integrated                  [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
  CUDA Device:0_ has [              1] .unifiedAddressing           [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]
  CUDA Device:1_ has <_compute capability_> == 2.0.
  CUDA Device:1_ has [    Tesla M2050] .name
  CUDA Device:1_ has [             14] .multiProcessorCount         [ Number of multiprocessors on device ]
  CUDA Device:1_ has [     2817982464] .totalGlobalMem              [ __global__   memory available on device in Bytes [B] ]
  CUDA Device:1_ has [          65536] .totalConstMem               [ __constant__ memory available on device in Bytes [B] ]
  CUDA Device:1_ has [        1147000] .clockRate                   [ GPU_CLK frequency in kilohertz [kHz] ]
  CUDA Device:1_ has [             32] .warpSize                    [ GPU WARP size in threads ]
  CUDA Device:1_ has [        1546000] .memoryClockRate             [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
  CUDA Device:1_ has [            384] .memoryBusWidth              [ GPU_DDR Global memory bus width in bits [b] ]
  CUDA Device:1_ has [           1024] .maxThreadsPerBlock          [ MAX Threads per Block ]
  CUDA Device:1_ has [          32768] .regsPerBlock                [ MAX number of 32-bit Registers available per Block ]
  CUDA Device:1_ has [           1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
  CUDA Device:1_ has [         786432] .l2CacheSize
  CUDA Device:1_ has [          49152] .sharedMemPerBlock           [ __shared__   memory available per Block in Bytes [B] ]
  CUDA Device:1_ has [              2] .asyncEngineCount            [ a number of asynchronous engines ]
  CUDA Device:1_ has [              1] .deviceOverlap               [ if Device can concurrently copy memory and execute a kernel ]
  CUDA Device:1_ has [              0] .kernelExecTimeoutEnabled    [ if there is a run time limit on kernel exec-s ]
  CUDA Device:1_ has [              1] .concurrentKernels           [ if Device can possibly execute multiple kernels concurrently ]
  CUDA Device:1_ has [              1] .canMapHostMemory            [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
  CUDA Device:1_ has [              3] .computeMode                 [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
  CUDA Device:1_ has [              1] .ECCEnabled                  [ if has ECC support enabled ]
  CUDA Device:1_ has [     2147483647] .memPitch                    [ MAX pitch in bytes allowed by memory copies [B] ]
  CUDA Device:1_ has [          65536] .maxSurface1D                [ MAX 1D surface size ]
  CUDA Device:1_ has [          32768] .maxSurfaceCubemap           [ MAX Cubemap surface dimensions ]
  CUDA Device:1_ has [          65536] .maxTexture1D                [ MAX 1D Texture size ]
  CUDA Device:1_ has [              0] .pciBusID                    [ PCI bus ID of the device ]
  CUDA Device:1_ has [              0] .integrated                  [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
  CUDA Device:1_ has [              1] .unifiedAddressing           [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]

最新更新