我试图通过编写和运行测试程序来了解硬件缓存的工作原理:
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#define LINE_SIZE 64
#define L1_WAYS 8
#define L1_SETS 64
#define L1_LINES 512
// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];
int main()
{
volatile uint8_t *addr;
register uint64_t i;
int junk = 0;
register uint64_t t1, t2;
printf("data: %pn", data);
//_mm_clflush(data);
printf("accessing 16 bytes in a cache line:n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ldn", i, t2);
}
}
我运行代码时没有_mm_clflush
,而结果只是显示_mm_clflush
第一次内存访问更快。
与_mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 280
i = 1, cycles: 84
i = 2, cycles: 91
i = 3, cycles: 77
i = 4, cycles: 91
不带_mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 3899
i = 1, cycles: 91
i = 2, cycles: 105
i = 3, cycles: 77
i = 4, cycles: 84
刷新缓存行只是没有意义,但实际上变得更快?谁能解释为什么会这样?谢谢
----------------进一步实验-------------------
假设 3899 个周期是由 TLB 未命中引起的。为了证明我对缓存命中/未命中的知识,我稍微修改了这段代码以比较L1 cache hit
和L1 cache miss
情况下的内存访问时间。
这一次,代码跳过缓存行大小(64 字节(并访问下一个内存地址。
*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ldn", i, t2);
}
// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i*LINE_SIZE];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ldn", i, t2);
}
由于我的电脑有一个8路集关联L1数据缓存,有64套,总共32KB。如果我每 64 字节访问一次内存,它应该会导致所有缓存未命中。但似乎有很多缓存行已经缓存:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 273
i = 1, cycles: 70
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 70
i = 5, cycles: 70
i = 6, cycles: 70
i = 7, cycles: 70
i = 8, cycles: 70
i = 9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i = 0, cycles: 301
i = 1, cycles: 133
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 147
i = 5, cycles: 56
i = 6, cycles: 70
i = 7, cycles: 63
i = 8, cycles: 70
i = 9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105
这是由预取引起的吗?还是我的理解有什么问题?谢谢
我通过在_mm_clflush(data)
之前添加写入来修改代码,它显示 clflush 确实刷新了缓存行。修改后的代码:
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#define LINE_SIZE 64
#define L1_LINES 512
// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];
int main()
{
volatile uint8_t *addr;
register uint64_t i;
unsigned int junk = 0;
register uint64_t t1, t2;
data[0] = 1; //write before cflush
//_mm_clflush(data);
printf("accessing 16 bytes in a cache line:n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ldn", i, t2);
}
}
我在计算机(英特尔(R(酷睿(TM(i5-8500 CPU(上运行修改后的代码,并得到以下结果。根据几次尝试,第一次访问之前刷新到内存的数据的延迟明显高于那些没有刷新到内存的数据。
不带 clflush:
data: 0000000000407980
accessing 16 bytes in a cache line:
i = 0, cycles: 64
i = 1, cycles: 46
i = 2, cycles: 49
i = 3, cycles: 48
i = 4, cycles: 46
与 clflush:
data: 0000000000407980
accessing 16 bytes in a cache line:
i = 0, cycles: 214
i = 1, cycles: 41
i = 2, cycles: 40
i = 3, cycles: 42
i = 4, cycles: 40
如果没有clflush
,第一次加载大约需要 3899 个周期,这大约是处理一个小页面错误所需的时间。rdtscp
序列化加载操作,从而确保以后所有加载到同一行的加载都在 L1 缓存中命中。现在,当您在循环之前添加clflush
时,页面错误将在循环外部触发和处理。当页面错误处理程序返回并重新执行clflush
时,目标缓存行将被刷新。在英特尔处理器上,rdtscp
确保在发出环路中的第一个负载之前刷新线路。因此,现金层次结构中的第一个加载未命中,其延迟将与内存访问的延迟大致相同。就像前一种情况一样,后面的负载由rdtscp
序列化,因此它们都在 L1D 中命中。
但是,即使我们考虑rdtscp
开销,测量的L1D命中延迟也太高了。你用-O3
编译了吗?
当缓存行静态分配时,我无法在 Linux 5.5.0-4.0-154 上使用 gcc 重现您的结果(即小页面错误(,但只有当我使用mmap
时。如果你告诉我你的编译器版本和内核版本,也许我可以进一步调查。
关于第二个问题,您测量负载延迟的方式将无法区分 L1D 命中和 L2 命中,因为测量误差可能与延迟差异一样大。可以使用MEM_LOAD_UOPS_RETIRED.L1_HIT
和MEM_LOAD_UOPS_RETIRED.L2_HIT
性能计数器进行检查。顺序访问模式很容易被 L1 和 L2 硬件预取程序检测到,因此,如果您不关闭预取程序,则受到命中也就不足为奇了。