为什么内存集慢



我的CPU规范说它应该获得5.336GB/s的内存带宽。为了测试这一点,我编写了一个简单的程序,在一个大数组上运行memset(或memcpy)并报告时间。我在memset上显示3.8GB/s,在memcpy上显示1.9GB/s。http://en.wikipedia.org/wiki/Intel_Core_(微体系结构)说我的Q9400应该是5.336MB/s。怎么了?

我试过用赋值循环替换memset或memcpy。我在谷歌上搜索了一下,试图了解记忆对齐。我尝试过不同的编译器标志。我在这件事上花了令人尴尬的几个小时。感谢您提供的任何帮助!

我使用的Ubuntu 12.04带有libc开发版本2.15-0ubuntu10.5和内核3.8.0-37-generic

代码:

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))
int main(int argc,char**argv){
    if(argc!=3){
        printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFERn",argv[0]);
        return -1;
    }
    char*__restrict__ source=(char*)malloc(numBytes);
    char*__restrict__ dest=(char*)malloc(numBytes);
    struct timespec start,end;
    long totalTimeMs;
    int i;
    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memset(source,0,numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memcpy( dest, source, numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
    free(source);
    free(dest);
    return EXIT_SUCCESS;
}

编译命令:

gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt 

样本输出:

memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).

我的硬件根据lshw:

  product: OptiPlex 960 ()
  vendor: Winbond Electronics
  width: 64 bits
*-core
     description: Motherboard
     product: 0Y958C
     vendor: Winbond Electronics
   *-firmware
        description: BIOS
        capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
   *-cpu
        product: Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
        physical id: 400
        size: 2666MHz
        width: 64 bits
        clock: 1333MHz
        capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
        configuration: cores=4 enabledcores=4 threads=4
      *-cache:0
           description: L1 cache
           physical id: 700
           size: 256KiB
           capacity: 256KiB
           capabilities: internal write-back unified
      *-cache:1
           description: L2 cache
           physical id: 701
           size: 6MiB
           capacity: 6MiB
           capabilities: internal varies unified
   *-memory
        description: System Memory
        physical id: 1000
        slot: System board or motherboard
        size: 4GiB
      *-bank:0
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
           product: CT51264AA667.M16FC
           vendor: 7F7F7F7F7F9B0000
           slot: DIMM_1
           size: 4GiB
           width: 64 bits
           clock: 667MHz (1.5ns)
      *-bank:1
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:2
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:3
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]

内存地址被"虚拟化",程序使用的地址被转换为真实地址。这种转换可以从当时方便的部分中分配程序所认为的连续内存。每个通用CPU都这样做。转换需要查找表,这需要访问内存。CPU有缓存,但长时间的虚拟地址很容易破坏其缓存"TLB"("翻译后备缓冲区")。因此,每4KB(linux系统上的2MB),CPU就会停下来寻找真正发送内存流量的地方。那些摊位可能需要相当长的时间。您可以尝试运行基准测试的两个副本,TLB未命中似乎是合理的,并且您将获得更接近额定容量的聚合带宽。

(编辑:嗯,你可能想用代替你的#define

size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);

在主体中…)

编辑:顺便说一句:我在盒子上的这次测试中看到的带宽(并在评论中报告)远远低于我的cpu额定容量,这让我开始调查我自己的系统。我的盒子制造商在那些插槽里放了非常糟糕的内存。我早就用一个众所周知的好品牌取代了它们,报告的吞吐量增加了一倍多,机器的性能也明显提高了。

上次我检查memcpy和memset在GCC中没有优化。2012年仍然如此。参见C++第2.6节Agner Fog的优化软件2.6"函数库的选择"和表2.1。他比较了几种不同的编译器和操作系统。

GCC内置了执行memcpy的函数。显然,它们甚至比Glib中的memcpy还要糟糕。据我所知,GCC开发人员和Glib开发人员是独立工作的。要从Glib获取库,您需要使用-fno-builtin。然而,尽管Glib现在(或者至少曾经)更好,但它仍然不是最佳的。要获得最佳效果,请使用Agner Fog的asmlib。他优化了memcpy和memset以及汇编中的许多其他常见函数,以利用SSE和AVX等优化。

相关内容

  • 没有找到相关文章

最新更新