INTEL X86,为什么对齐访问和非对齐访问具有相同的性能



根据INTEL CPU手册(英特尔®;64和IA-32体系结构软件开发人员手册第3卷(3A、3B、3C和3D(:系统编程指南8.1.1(;不对齐的数据访问将严重影响处理器的性能";。然后我做了一个测试来证明这一点。但结果是对齐和非对齐的数据访问具有相同的性能。为什么?有人能帮忙吗?我的代码如下所示:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
using namespace std;
static inline int64_t get_time_ns()
{
std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
return a.count();
}
int main(int argc, char** argv)
{
if (argc < 2) {
cout << "Usage:./test [01234567]" << endl;
cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
return 0;
}
uint64_t offset = atoi(argv[1]);
cout << "offset = " << offset << endl;
const uint64_t BUFFER_SIZE = 800000000;
uint8_t* data_ptr = new uint8_t[BUFFER_SIZE];
if (data_ptr == nullptr) {
cout << "apply for memory failed" << endl;
return 0;
}
memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
const uint64_t LOOP_CNT = 300;
cout << "start" << endl;
auto start = get_time_ns();
for (uint64_t i = 0; i < LOOP_CNT; ++i) {
for (uint64_t j = offset; j <= BUFFER_SIZE - 8; j+= 8) { // align:offset = 0 nonalign: offset=1-7
volatile auto tmp = *(uint64_t*)&data_ptr[j]; // read from memory
//mov rax,QWORD PTR [rbx+rdx*1] // rbx+rdx*1 = 0x7fffc76fe019 
//mov QWORD PTR [rsp+0x8],rax 
++tmp;
//mov rcx,QWORD PTR [rsp+0x8] 
//add rcx,0x1 
//mov QWORD PTR [rsp+0x8],rcx
*(uint64_t*)&data_ptr[j] = tmp; // write to memory
//mov rcx,QWORD PTR [rbx+rdx*1],rcx
}
}
auto end = get_time_ns();
cout << "time elapse " << end - start << "ns" << endl;
return 0;
}

结果:

offset = 0
start
time elapse 32991486013ns
offset = 1
start
time elapse 34089866539ns
offset = 2
start
time elapse 34011790606ns
offset = 3
start
time elapse 34097518021ns
offset = 4
start
time elapse 34166815472ns
offset = 5
start
time elapse 34081477780ns
offset = 6
start
time elapse 34158804869ns
offset = 7
start
time elapse 34163037004ns

在大多数现代x86内核上,只有访问不跨越特定的内部边界时,对齐和未对齐的性能才相同。

内部边界的确切大小因相关CPU的核心架构而异,但在过去十年的英特尔CPU上,相关边界是64字节的缓存线。也就是说,完全落在64字节高速缓存行内的访问执行相同的操作,而不管它们是否对齐。

然而,如果访问(必然错位(越过英特尔芯片上的缓存线边界,则延迟和吞吐量将受到约2倍的惩罚。这种惩罚的底线影响取决于周围的代码,通常会小于2倍,有时甚至接近零。如果4K页面边界也被跨越,这个适度的惩罚可能会大得多。

对齐的访问永远不会跨越这些边界,因此不能遭受这种惩罚。

AMD芯片的总体情况类似,尽管最近的一些芯片的相关边界小于64字节,并且加载和存储的边界不同。

在我写的一篇博客文章的加载吞吐量和存储吞吐量部分,我已经包含了更多的细节。

测试它

由于以下几个原因,您的测试无法显示效果:

  • 测试没有分配对齐的内存,您无法通过使用来自未知对齐区域的偏移量来可靠地跨越缓存线
  • 您一次迭代8个字节,因此大多数写入(8个字节中的7个(都将落在缓存行中,任何一个都不会受到惩罚,从而产生一个小信号,只有在基准测试的其余部分非常干净时才能检测到
  • 您使用的缓冲区大小很大,不适合任何级别的缓存。分割线效应仅在L1处相当明显,或者当分割线意味着引入两倍的线数时(例如,随机访问(。由于在任何一种情况下都是线性访问每一行,因此无论是否拆分,从DRAM到核心的吞吐量都会限制您:在等待主内存时,拆分写入有足够的时间完成
  • 使用本地volatile auto tmptmp++在堆栈上创建一个volatile,并使用大量加载和存储来保留volatile语义:这些都是对齐的,将消除您试图用测试测量的效果

这是我对您的测试的修改,仅在L1区域中操作,并且一次前进64个字节,因此每个存储都将被拆分(如果有的话(:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
#include <iomanip>
using namespace std;
static inline int64_t get_time_ns()
{
std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
return a.count();
}
int main(int argc, char** argv)
{
if (argc < 2) {
cout << "Usage:./test [01234567]" << endl;
cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
return 0;
}
uint64_t offset = atoi(argv[1]);
const uint64_t BUFFER_SIZE = 10000;
alignas(64) uint8_t data_ptr[BUFFER_SIZE];
memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
const uint64_t LOOP_CNT = 1000000;
auto start = get_time_ns();
for (uint64_t i = 0; i < LOOP_CNT; ++i) {
uint64_t src = rand();
for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7
memcpy(data_ptr + j, &src, 8);
}
}
auto end = get_time_ns();
cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) <<
"ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl;
return 0;
}

对0到64中的所有比对运行此操作,我得到:

$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done
0 :time elapsed 0.56ns per write (rand:0)
1 :time elapsed 0.57ns per write (rand:0)
2 :time elapsed 0.57ns per write (rand:0)
3 :time elapsed 0.56ns per write (rand:0)
4 :time elapsed 0.56ns per write (rand:0)
5 :time elapsed 0.56ns per write (rand:0)
6 :time elapsed 0.57ns per write (rand:0)
7 :time elapsed 0.56ns per write (rand:0)
8 :time elapsed 0.57ns per write (rand:0)
9 :time elapsed 0.57ns per write (rand:0)
10 :time elapsed 0.57ns per write (rand:0)
11 :time elapsed 0.56ns per write (rand:0)
12 :time elapsed 0.56ns per write (rand:0)
13 :time elapsed 0.56ns per write (rand:0)
14 :time elapsed 0.56ns per write (rand:0)
15 :time elapsed 0.57ns per write (rand:0)
16 :time elapsed 0.56ns per write (rand:0)
17 :time elapsed 0.56ns per write (rand:0)
18 :time elapsed 0.56ns per write (rand:0)
19 :time elapsed 0.56ns per write (rand:0)
20 :time elapsed 0.56ns per write (rand:0)
21 :time elapsed 0.56ns per write (rand:0)
22 :time elapsed 0.56ns per write (rand:0)
23 :time elapsed 0.56ns per write (rand:0)
24 :time elapsed 0.56ns per write (rand:0)
25 :time elapsed 0.56ns per write (rand:0)
26 :time elapsed 0.56ns per write (rand:0)
27 :time elapsed 0.56ns per write (rand:0)
28 :time elapsed 0.57ns per write (rand:0)
29 :time elapsed 0.56ns per write (rand:0)
30 :time elapsed 0.57ns per write (rand:25)
31 :time elapsed 0.56ns per write (rand:151)
32 :time elapsed 0.56ns per write (rand:123)
33 :time elapsed 0.56ns per write (rand:29)
34 :time elapsed 0.55ns per write (rand:0)
35 :time elapsed 0.56ns per write (rand:0)
36 :time elapsed 0.57ns per write (rand:0)
37 :time elapsed 0.56ns per write (rand:0)
38 :time elapsed 0.56ns per write (rand:0)
39 :time elapsed 0.56ns per write (rand:0)
40 :time elapsed 0.56ns per write (rand:0)
41 :time elapsed 0.56ns per write (rand:0)
42 :time elapsed 0.57ns per write (rand:0)
43 :time elapsed 0.56ns per write (rand:0)
44 :time elapsed 0.56ns per write (rand:0)
45 :time elapsed 0.56ns per write (rand:0)
46 :time elapsed 0.57ns per write (rand:0)
47 :time elapsed 0.57ns per write (rand:0)
48 :time elapsed 0.56ns per write (rand:0)
49 :time elapsed 0.56ns per write (rand:0)
50 :time elapsed 0.57ns per write (rand:0)
51 :time elapsed 0.56ns per write (rand:0)
52 :time elapsed 0.56ns per write (rand:0)
53 :time elapsed 0.56ns per write (rand:0)
54 :time elapsed 0.55ns per write (rand:0)
55 :time elapsed 0.56ns per write (rand:0)
56 :time elapsed 0.56ns per write (rand:0)
57 :time elapsed 1.1ns per write (rand:0)
58 :time elapsed 1.1ns per write (rand:0)
59 :time elapsed 1.1ns per write (rand:0)
60 :time elapsed 1.1ns per write (rand:0)
61 :time elapsed 1.1ns per write (rand:0)
62 :time elapsed 1.1ns per write (rand:0)
63 :time elapsed 1ns per write (rand:0)
64 :time elapsed 0.56ns per write (rand:0)

请注意,偏移57到63每次写入的长度都约为2倍,而这些偏移正是针对8字节写入跨越64字节(高速缓存线(边界的偏移。

最新更新