缓存错过压力测试:令人惊叹的结果.任何解释



为了使现代计算机的实际性能相对在缓存中(记忆中的数据是如何传播'),我进行了一个简单的测试,我分配了500 MB的RAM,然后执行恒定数量的读数,然后我通过增加的字节偏移执行该测试。最后,当我到达时,我在1000 MB缓冲区的末端包裹。

我对结果感到惊讶。看起来有32个字节的成本障碍,另一个大约32 kb。我想这与L1/L2/L3缓存负载或虚拟内存页面大小有关吗?最让我震惊的是,似乎只有16个完全不同的记忆位置正在缓存。那很低!!!任何解释(OS,硬件)?

这是3个不同计算机的结果,从最快的计算机到最便宜的计算机,其次是我的简单测试代码(仅使用标准LIB)。

16 GB RAM快速HP工作站(在32位Windows中进行测试):

time=0.364000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.231000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.339000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=0.567000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=1.177000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=1.806000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=2.293000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=2.464000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=2.621000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=2.775000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=2.908000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=3.007000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=3.183000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=3.758000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=4.287000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=6.366000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=6.124000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=5.295000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=5.540000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=5.844000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=5.785000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=5.714000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=5.825000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=5.759000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=2.222000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.471000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.377000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.166000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

4 GB RAM MACBOOCPRO 2010(32位Windows测试):

time=0.476000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.357000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.634000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=1.173000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=2.360000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=3.469000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=3.990000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=3.549000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=3.758000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=3.867000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=4.275000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=4.310000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=4.584000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=5.144000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=6.100000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=8.111000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=6.256000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=6.311000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=6.416000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=6.635000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=6.530000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=6.544000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=6.545000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=5.272000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=1.524000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.538000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.508000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.817000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

4GB RAM便宜的ACER"家庭计算机":

time=0.732000 byteIncrement=4 numReadLocations=262144000 numReads=262144000
time=0.549000 byteIncrement=8 numReadLocations=131072000 numReads=262144000
time=0.765000 byteIncrement=16 numReadLocations=65536000 numReads=262144000
time=1.196000 byteIncrement=32 numReadLocations=32768000 numReads=262144000
time=2.318000 byteIncrement=64 numReadLocations=16384000 numReads=262144000
time=2.483000 byteIncrement=128 numReadLocations=8192000 numReads=262144000
time=2.760000 byteIncrement=256 numReadLocations=4096000 numReads=262144000
time=3.194000 byteIncrement=512 numReadLocations=2048000 numReads=262144000
time=3.369000 byteIncrement=1024 numReadLocations=1024000 numReads=262144000
time=3.720000 byteIncrement=2048 numReadLocations=512000 numReads=262144000
time=4.842000 byteIncrement=4096 numReadLocations=256000 numReads=262144000
time=5.797000 byteIncrement=8192 numReadLocations=128000 numReads=262144000
time=9.865000 byteIncrement=16384 numReadLocations=64000 numReads=262144000
time=19.273000 byteIncrement=32768 numReadLocations=32000 numReads=262144000
time=19.282000 byteIncrement=65536 numReadLocations=16000 numReads=262144000
time=19.606000 byteIncrement=131072 numReadLocations=8000 numReads=262144000
time=20.242000 byteIncrement=262144 numReadLocations=4000 numReads=262144000
time=20.956000 byteIncrement=524288 numReadLocations=2000 numReads=262144000
time=22.627000 byteIncrement=1048576 numReadLocations=1000 numReads=262144000
time=24.336000 byteIncrement=2097152 numReadLocations=500 numReads=262144000
time=24.403000 byteIncrement=4194304 numReadLocations=250 numReads=262144000
time=23.060000 byteIncrement=8388608 numReadLocations=125 numReads=262144000
time=20.553000 byteIncrement=16777216 numReadLocations=62 numReads=262144000
time=14.460000 byteIncrement=33554432 numReadLocations=31 numReads=262144000
time=1.752000 byteIncrement=67108864 numReadLocations=15 numReads=262144000
time=0.963000 byteIncrement=134217728 numReadLocations=7 numReads=262144000
time=0.687000 byteIncrement=268435456 numReadLocations=3 numReads=262144000
time=0.453000 byteIncrement=536870912 numReadLocations=1 numReads=262144000

代码:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MEMBLOCSIZE ((2<<20)*500)//1000MB
int readMemory( int* data, int* dataEnd, int numReads, int incrementPerRead ) {
  int accum = 0;
  int* ptr = data;
  while(true) {
    accum += *ptr;
    if( numReads-- == 0)
      return accum;
    ptr += incrementPerRead;
    if( ptr >= dataEnd )
      ptr = data;
  }
}
int main()
{
  int* data = (int*)malloc(MEMBLOCSIZE);
  int* dataEnd = data+(MEMBLOCSIZE / sizeof(int));
  int numReads = (MEMBLOCSIZE / sizeof(int));
  int dummyTotal = 0;
  int increment = 1;
  for( int loop = 0; loop < 28; ++loop ) {
    int startTime = clock();
    dummyTotal += readMemory(data, dataEnd, numReads, increment);
    int endTime = clock();
    double deltaTime = double(endTime-startTime)/double(CLOCKS_PER_SEC);
    printf("time=%f byteIncrement=%d numReadLocations=%d numReads=%dn",
      deltaTime, increment*sizeof(int), MEMBLOCSIZE/(increment*sizeof(int)), numReads);
    increment *= 2;
  }
  //Use dummyTotal: make sure the optimizer is not removing my code...
  return dummyTotal == 666 ? 1: 0;
}

根据一些注释,我修改了测试以仅使用250 MB的RAM,并为每次"读取"进行16次读取,以防预取料。它仍然具有相似的结果,但是最后的测试是,读取距离很少的位置的测试具有更好的性能(2秒而不是5秒),因此可能是因为初始测试未激活预摘要。<<<<<<<<<<<<<<<</p>

#define MEMBLOCSIZE 262144000//250MB
int readMemory( int* data, int* dataEnd, int numReads, int incrementPerRead ) {
  int accum = 0;
  int* ptr = data;
  while(true) {
    accum += *ptr;
    if( numReads-- == 0)
      return accum;
    //Do 16 consecutive reads
    for( int i = 1; i < 17; ++i )
      accum += *(ptr+i);
    ptr += incrementPerRead;
    if( ptr >= dataEnd+17 )
      ptr = data;
  }
}

MacBookPro的此更新测试的结果2010:

time=0.691000 byteIncrement=4 numReadLocations=65536000 numReads=65536000
time=0.620000 byteIncrement=8 numReadLocations=32768000 numReads=65536000
time=0.715000 byteIncrement=16 numReadLocations=16384000 numReads=65536000
time=0.827000 byteIncrement=32 numReadLocations=8192000 numReads=65536000
time=0.917000 byteIncrement=64 numReadLocations=4096000 numReads=65536000
time=1.440000 byteIncrement=128 numReadLocations=2048000 numReads=65536000
time=2.646000 byteIncrement=256 numReadLocations=1024000 numReads=65536000
time=3.720000 byteIncrement=512 numReadLocations=512000 numReads=65536000
time=3.854000 byteIncrement=1024 numReadLocations=256000 numReads=65536000
time=4.673000 byteIncrement=2048 numReadLocations=128000 numReads=65536000
time=4.729000 byteIncrement=4096 numReadLocations=64000 numReads=65536000
time=4.784000 byteIncrement=8192 numReadLocations=32000 numReads=65536000
time=5.021000 byteIncrement=16384 numReadLocations=16000 numReads=65536000
time=5.022000 byteIncrement=32768 numReadLocations=8000 numReads=65536000
time=4.871000 byteIncrement=65536 numReadLocations=4000 numReads=65536000
time=5.163000 byteIncrement=131072 numReadLocations=2000 numReads=65536000
time=5.276000 byteIncrement=262144 numReadLocations=1000 numReads=65536000
time=4.699000 byteIncrement=524288 numReadLocations=500 numReads=65536000
time=1.997000 byteIncrement=1048576 numReadLocations=250 numReads=65536000
time=2.118000 byteIncrement=2097152 numReadLocations=125 numReads=65536000
time=2.071000 byteIncrement=4194304 numReadLocations=62 numReads=65536000
time=2.036000 byteIncrement=8388608 numReadLocations=31 numReads=65536000
time=1.923000 byteIncrement=16777216 numReadLocations=15 numReads=65536000
time=1.084000 byteIncrement=33554432 numReadLocations=7 numReads=65536000
time=0.607000 byteIncrement=67108864 numReadLocations=3 numReads=65536000
time=0.622000 byteIncrement=134217728 numReadLocations=1 numReads=65536000

请注意,以下大多数(如您得出的结论)都是投机性的。内存基准测试是超复杂的,并且像您这样做的方式相对幼稚的基准测试很少提供有关真实程序性能的大量确定信息。

您以32 KIB命名的主要"成本障碍"可能是可能的在64KIB(或两者的组合)。由于您不初始化内存,因此在阅读时,Windows会删除零页面。分配粒度为64 KIB,即使在64 KIB范围内的一个页面之一移入您的工作集中,页面始终被"准备"(如果您的存储器映射为手段)。我发现这是通过内存映射实验的东西。

Windows设置的过程设置的工作设置默认情况下是荒谬的,因此,当您在该内存块上迭代时,您将不断具有页面故障。有些价格较低,仅在页面描述符中更改标志,而另一些则(64 KIB)更昂贵,从零池中拉出16个新页面(或者,在最坏的情况下,如果池为空,则零页)。这很可能解释了您看到的"成本障碍"之一。

正如您正确注意到的,另一个成本障碍是缓存关联性。较大的两个偏移量的不同地址使用相同的缓存条目。鉴于"不健康"的偏移,可以一遍又一遍地驱逐同一缓存线。这是对准很好的两个主要原因之一,但是过度对准不好(另一个不是数据的位置)。

32个字节的成本障碍令人惊讶,如果有的话,人们可以想象它是64个字节(在测试架构上越过高速缓存线)。预摘要在大多数情况下应消除这种失速,但是通常仅激活预摘要(如果您不明确提示它)在第二个高速缓存线失踪后,

这对于"真实"程序是完全可以接受的,该程序仅读取一个位置,或者依次迭代大部分数据。另一方面,在进行人工测量时,它可能很容易产生令人困惑的结果。这也可能是一个可能的解释,为什么您会看到32 KIB的成本障碍。如果预摘要不起作用,那将是您在典型x86上用光的L1缓存的点。

最新更新