clflush 通过 C 函数使缓存行失效



我正在尝试使用clflush手动逐出缓存行以确定缓存和行大小。我没有找到任何有关如何使用该指令的指南。我所看到的只是一些为此目的使用更高级函数的代码。

有一个内核函数void clflush_cache_range(void *vaddr, unsigned int size),但我仍然不知道在我的代码中包含什么以及如何使用它。我不知道该功能的size是什么。

不仅如此,我如何确定该行被逐出以验证我的代码的正确性?

更新:

这是我正在尝试执行的操作的初始代码。

#include <immintrin.h>
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
int array[ 100 ];
/* will bring array in the cache */
for ( int i = 0; i < 100; i++ )
array[ i ] = i;
/* FLUSH A LINE */
/* each element is 4 bytes */
/* assuming that cache line size is 64 bytes */
/* array[0] till array[15] is flushed */
/* even if line size is less than 64 bytes */
/* we are sure that array[0] has been flushed */
_mm_clflush( &array[ 0 ] );

int tm = 0;
register uint64_t time1, time2, time3;

time1 = __rdtscp( &tm ); /* set timer */
time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
printf( "miss latency = %lu n", time2 );
time3 = __rdtscp( &array[ 0 ] ) - time2; /* array[0] is a cache hit */
printf( "hit latency = %lu n", time3 );
return 0;
}

在运行代码之前,我想手动验证它是否正确代码。我走的路正确吗?我是否正确使用了_mm_clflush

更新:

感谢彼得的评论,我修复了代码如下

time1 = __rdtscp( &tm ); /* set timer */
time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
printf( "miss latency = %lu n", time2 );
time1 = __rdtscp( &tm ); /* set timer */
time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache hit */
printf( "hit latency = %lu n", time1 );

通过多次运行代码,我得到以下输出

$ ./flush
miss latency = 238
hit latency = 168
$ ./flush
miss latency = 154
hit latency = 140
$ ./flush
miss latency = 252
hit latency = 140
$ ./flush
miss latency = 266
hit latency = 252

第一次运行似乎是合理的。但第二次运行看起来很奇怪。通过从命令行运行代码,每次使用值初始化数组时,我都会显式逐出第一行。

UPDATE4:

我尝试了Hadi-Brais代码,这是输出

naderan@webshub:~$ ./flush3
address = 0x7ffec7a92220
array[ 0 ] = 0
miss section latency = 378
array[ 0 ] = 0
hit section latency = 175
overhead latency = 161
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 217 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffedbe0af40
array[ 0 ] = 0
miss section latency = 392
array[ 0 ] = 0
hit section latency = 231
overhead latency = 168
Measured L1 hit latency = 63 TSC cycles
Measured main memory latency = 224 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffead7fdc90
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 252 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffe51a77310
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 182
overhead latency = 161
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 203 TSC cycles

稍微不同的延迟是可以接受的。但是,与 21 和 14 相比,命中延迟为 63 也是可以观察到的。

UPDATE5:

当我检查 Ubuntu 时,没有启用省电功能。也许在 BIOS 中禁用了频率更改,或者缺少配置

$ cat /proc/cpuinfo  | grep -E "(model|MHz)"
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz         : 2097.571
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz  
cpu MHz         : 2097.571
$ lscpu | grep MHz
CPU MHz:             2097.571

无论如何,这意味着频率设置为其最大值,这是我必须关心的。通过多次运行,我看到了一些不同的值。这些正常吗?

$ taskset -c 0 ./flush3
address = 0x7ffe30c57dd0
array[ 0 ] = 0
miss section latency = 602
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 455 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffd16932fd0
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 168
overhead latency = 147
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 252 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffeafb96580
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 161
overhead latency = 140
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffe58291de0
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 168
overhead latency = 140
Measured L1 hit latency = 28 TSC cycles
Measured main memory latency = 217 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7fffa76d20b0
array[ 0 ] = 0
miss section latency = 371
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffdec791580
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 189
overhead latency = 147
Measured L1 hit latency = 42 TSC cycles
Measured main memory latency = 210 TSC cycles

代码中有多个错误,可能会导致您看到的无意义的测量。我已经修复了错误,您可以在下面的评论中找到解释。

/* compile with gcc at optimization level -O3 */
/* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ 
/* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */
/* you can optionally use a p-state scaling driver other than intel_pstate to get more reproducable results */
/* This code still needs improvement to obtain more accurate measurements,
and a lot of effort is required to do that—argh! */
/* Specifically, there is no single constant latency for the L1 because of
the way it's designed, and more so for main memory. */
/* Things such as virtual addresses, physical addresses, TLB contents,
code addresses, and interrupts may have an impact that needs to be
investigated */
/* The instructions that GCC puts unnecessarily in the timed section are annoying AF */
/* This code is written to run on Intel processors! */
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
int array[ 100 ];
/* this is optional */
/* will bring array in the cache */
for ( int i = 0; i < 100; i++ )
array[ i ] = i;
printf( "address = %p n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */
_mm_mfence();                      /* prevent clflush from being reordered by the CPU or the compiler in this direction */
/* flush the line containing the element */
_mm_clflush( &array[ 0 ] );
//unsigned int aux;
uint64_t time1, time2, msl, hsl, osl; /* initial values don't matter */
/* You can generally use rdtsc or rdtscp.
See: https://stackoverflow.com/questions/59759596/is-there-any-difference-in-between-rdtsc-lfence-rdtsc-and-rdtsc-rdtscp
I AM NOT SURE THOUGH THAT THE SERIALIZATION PROERTIES OF
RDTSCP ARE APPLICABLE AT THE COMPILER LEVEL WHEN USING THE
__RDTSCP INTRINSIC. THIS IS TRUE FOR PURE FENCES SUCH AS LFENCE. */
_mm_mfence();                      /* this properly orders both clflush and rdtsc*/
_mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
time1 = __rdtsc();                 /* set timer */
_mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
int temp = array[ 0 ];             /* array[0] is a cache miss */
/* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
/* no need for mfence because there are no stores in between */
_mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load*/
time2 = __rdtsc();
_mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
msl = time2 - time1;
printf( "array[ 0 ] = %i n", temp );             /* prevent the compiler from optimizing the load */
printf( "miss section latency = %lu n", msl );   /* the latency of everything in between the two rdtsc */
_mm_mfence();                      /* this properly orders both clflush and rdtsc*/
_mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
time1 = __rdtsc();                 /* set timer */
_mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
temp = array[ 0 ];                 /* array[0] is a cache hit as long as the OS, a hardware prefetcher, or a speculative accesses to the L1D or lower level inclusive caches don't evict it */
/* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
/* no need for mfence because there are no stores in between */
_mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load */
time2 = __rdtsc();
_mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
hsl = time2 - time1;
printf( "array[ 0 ] = %i n", temp );            /* prevent the compiler from optimizing the load */
printf( "hit section latency = %lu n", hsl );   /* the latency of everything in between the two rdtsc */

_mm_mfence();                      /* this properly orders both clflush and rdtsc */
_mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
time1 = __rdtsc();                 /* set timer */
_mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
/* no need for mfence because there are no stores in between */
_mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
time2 = __rdtsc();
_mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
osl = time2 - time1;
printf( "overhead latency = %lu n", osl ); /* the latency of everything in between the two rdtsc */

printf( "Measured L1 hit latency = %lu TSC cyclesn", hsl - osl ); /* hsl is always larger than osl */
printf( "Measured main memory latency = %lu TSC cyclesn", msl - osl ); /* msl is always larger than osl and hsl */
return 0;
}

强烈建议:使用时间戳计数器测量内存延迟。

相关: 如何在实践中创建一个幽灵小工具?

你知道你可以用cpuid查询行长,对吧? 如果您确实想以编程方式查找它,请执行此操作。 (否则,假设它是 64 个字节,因为它在 PIII 之后的所有内容上。

但是,如果出于任何原因想使用来自 C 的clflushclflushopt,请使用void _mm_clflush(void const *p)void _mm_clflushopt(void const *p),来自#include <immintrin.h>. (请参阅英特尔的 insn 集 ref 手册条目,了解clflushclflushopt(。

GCC、clang、ICC 和 MSVC 都支持英特尔的<immintrin.h>内联函数。


您也可以通过搜索英特尔的内部函数指南来查找clflush以查找该指令的内部函数的定义。

另请参阅 https://stackoverflow.com/tags/x86/info 以获取指南、文档和参考手册的更多链接。


不仅如此,我如何确定该行被逐出以验证我的代码的正确性?

查看编译器的 asm 输出,或在调试器中单步执行。 如果/当clflush执行时,该缓存行将在程序中的该点逐出。

最新更新