我想实现一个 2 线程模型,其中 1 个正在计数(无限递增一个值),另一个正在记录第一个计数器,完成工作,记录第二个记录并测量两者之间经过的时间。
这是我到目前为止所做的:
// global counter
register unsigned long counter asm("r13");
// unsigned long counter;
void* counter_thread(){
// affinity is set to some isolated CPU so the noise will be minimal
while(1){
//counter++; // Line 1*
asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
}
}
void* measurement_thread(){
// affinity is set somewhere over here
unsigned long meas = 0;
unsigned long a = 5;
unsigned long r1,r2;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a *=3; // dummy operation that I want to measure
r2 = counter;
mfence();
meas = r2-r1;
printf("counter:%ld n", counter);
break;
}
}
让我解释一下到目前为止我所做的工作:
由于我希望计数器准确,因此我正在设置与隔离 CPU 的关联。另外,如果我在第 1 行*中使用计数器,则分解函数将是:
d4c: 4c 89 e8 mov %r13,%rax
d4f: 48 83 c0 01 add $0x1,%rax
d53: 49 89 c5 mov %rax,%r13
d56: eb f4 jmp d4c <counter_thread+0x37>
这不是 1 个循环操作。这就是为什么我使用内联组装来减少 2 mov 指令的原因。使用内联程序集:
d4c: 49 83 c5 01 add $0x1,%r13
d50: eb fa jmp d4c <counter_thread+0x37>
但问题是,这两种实现都不起作用。另一个线程看不到正在更新的计数器。如果我使全局计数器值不是寄存器,那么它就可以工作,但我想精确。如果我将全局计数器值设为unsigned long counter
则计数器线程的反汇编代码为:
d4c: 48 8b 05 ed 12 20 00 mov 0x2012ed(%rip),%rax # 202040 <counter>
d53: 48 83 c0 01 add $0x1,%rax
d57: 48 89 05 e2 12 20 00 mov %rax,0x2012e2(%rip) # 202040 <counter>
d5e: eb ec jmp d4c <counter_thread+0x37>
它有效,但它没有给我想要的粒度。
编辑:
我的环境:
- 中央处理器: AMD 锐龙 3600
- 内核:5.0.0-32 通用
- 操作系统: 乌班图 18.04
编辑2:我隔离了2个相邻CPU内核(即核心10和11),并在这些内核上运行实验。计数器在其中一个核心上,测量在另一个内核上。隔离是通过使用/etc/default/grub 文件并添加 isolcpus 线来完成的。
编辑3:我知道一次测量是不够的。我已经运行了1000万次实验并查看了结果。
实验1: 设置:
unsigned long counter =0;//global counter
void* counter_thread(){
mfence();
while(1)
counter++;
}
void* measurement_thread(){
unsigned long i=0, r1=0,r2=0;
unsigned int a=0;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a +=3;
r2 = counter;
mfence();
measurements[r2-r1]++;
i++;
if(i == MILLION_ITER)
break;
}
}
结果1: 在 99.99% 中,我得到了 0。我期望这是因为第一个线程未运行,或者操作系统或其他中断干扰了测量。去掉 0 和非常高的值,平均给我 20 个测量周期。(我期待 3-4,因为我只做整数加法)。
实验2:
设置:与上面相同,一个区别是,我使用计数器作为寄存器,而不是全局计数器:
register unsigned long counter asm("r13");
结果 2:测量线程始终读取 0。在反汇编的代码中,我可以看到两者都在处理 R13 寄存器(计数器),但是,我相信它没有以某种方式共享。
实验3:
设置:与setup2相同,除了在计数器线程中,我不是做计数器++,而是做一个内联程序集,以确保我正在做一个循环操作。我的反汇编文件如下所示:
cd1: 49 83 c5 01 add $0x1,%r13
cd5: eb fa jmp cd1 <counter_thread+0x37>
结果3:测量线程读数为0,如上所示。
每个线程都有自己的寄存器。每个逻辑 CPU 内核都有自己的体系结构寄存器,线程在内核上运行时使用这些寄存器。 只有信号处理程序(或在裸机上,中断)可以修改其线程的寄存器。
像在多线程程序中的... asm("r13")
一样声明 GNU C asm 寄存器全局有效地为您提供线程本地存储,而不是真正的共享全局存储。
线程之间仅共享内存,而不在寄存器之间共享内存。 这就是多个线程可以同时运行而不相互踩踏的方式,每个线程都使用它们的寄存器。
编译器可以自由使用未声明为寄存器全局的寄存器,因此在内核之间共享它们根本不起作用。 (GCC 无法使它们共享与私有,具体取决于您如何声明它们。
即使除此之外,寄存器全局不是volatile
或atomic
所以r1 = counter;
,r2 = counter;
可以 CSE 所以r2-r1
是一个编译时常数零,即使你的本地 R13 正在从信号处理程序更改。
如何确保两个线程都使用寄存器对计数器值进行读/写操作?
你不能这么做。内核之间没有共享状态,可以以比缓存更低的延迟进行读取/写入。
如果要对某些内容进行计时,请考虑使用rdtsc
来获取参考周期,或rdpmc
读取性能计数器(您可能已将其设置为计算内核时钟周期)。
使用另一个线程来递增计数器是不必要的,并且没有帮助,因为没有非常低开销的方法可以从另一个内核读取某些内容。
我的机器中的rdtscp指令最多只能提供36-72-108...周期分辨率。因此,我无法区分 2 个周期和 35 个周期之间的区别,因为它们都会给出 36 个周期。
那么你用错rdtsc
。 它不是序列化的,因此您需要围绕定时区域进行lfence
。 请参阅我的答案 如何从C++获取 CPU 周期计数x86_64?. 但是,是的,rdtsc
很昂贵,rdpmc
开销略低。
但更重要的是,你不能用周期中的单个成本来有效地衡量 C 语言中的a *=3;
。 首先,它可以根据上下文进行不同的编译。
但假设正常lea eax, [rax + rax*2]
,一个现实的指令成本模型有3个维度:uop计数(前端),后端端口压力和从输入到输出的延迟。 https://agner.org/optimize/
请参阅我在 NASM 中 RDTSCP 上的答案,始终返回相同的值,以获取有关对单个指令进行计时的更多信息。 以不同的方式将其置于循环中以测量吞吐量与延迟,并查看性能计数器以获取 uops->ports。 或者看看Agner Fog的说明书和 https://uops.info/因为人们已经做了这些测试。
也
- 每个汇编指令需要多少个 CPU 周期?
- 预测现代超标量处理器上的操作延迟需要考虑哪些因素,以及如何手动计算它们?
- 现代 x86 成本模型
同样,这些是你如何计时单个asm指令,而不是C语句。 启用优化后,C 语句的成本可能取决于它如何优化到周围的代码中。 (和/或周围操作的延迟是否隐藏了其成本,在像所有现代 x86 CPU 一样的无序执行 CPU 上。
那么你用错了rdtsc。它不是序列化的,所以你需要 lfence 在定时区域周围。查看我关于如何获取 CPU 周期的答案 从C++算x86_64?。但是,是的,rdtsc很昂贵,而rdpmc是 只是开销略低。
还行。我做了功课。
首先要做的事。我知道rdtscp
是序列化指令。我不是在谈论rdtsc
,最后有一封P
信。
我已经检查了英特尔和AMD手册。
- 英特尔手册页,第 83 页,表 2-3。系统摘要 指示
- AMD 手册页 403-406
如果我错了,请纠正我,但是,从我读到的内容来看,我明白我不需要rdtscp
前后fence
指令,因为它是序列化指令,对吧?
第二件事是,我确实在我的 3 台机器上运行了一些实验。以下是结果
锐龙实验
======================= AMD RYZEN EXPERIMENTS =========================
RYZEN 3600
100_000 iteration
Using a *=3
Not that, almost all sums are divisible by 36, which is my machine's timer resolution.
I also checked where the sums are not divisible by 36.
This is the case where I don't use fence instructions with rdtsc.
It turns out that the read value is either 35, or 1,
which I believe the instruction(rdtsc) cannot read the value correctly.
Mfenced rtdscP reads:
Sum: 25884432
Avg: 258
Sum, removed outliers: 25800120
Avg, removed outliers: 258
Mfenced rtdsc reads:
Sum: 17579196
Avg: 175
Sum, removed outliers: 17577684
Avg, removed outliers: 175
Lfenced rtdscP reads:
Sum: 7511688
Avg: 75
Sum, removed outliers: 7501608
Avg, removed outliers: 75
Lfenced rtdsc reads:
Sum: 7024428
Avg: 70
Sum, removed outliers: 7015248
Avg, removed outliers: 70
NOT fenced rtdscP reads:
Sum: 6024888
Avg: 60
Sum, removed outliers: 6024888
Avg, removed outliers: 60
NOT fenced rtdsc reads:
Sum: 3274866
Avg: 32
Sum, removed outliers: 3232913
Avg, removed outliers: 35
======================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 36217404
Avg: 362
Sum, removed outliers: 36097164
Avg, removed outliers: 361
Mfenced rtdsc reads:
Sum: 22973400
Avg: 229
Sum, removed outliers: 22939236
Avg, removed outliers: 229
Lfenced rtdscP reads:
Sum: 13178196
Avg: 131
Sum, removed outliers: 13177872
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 12631932
Avg: 126
Sum, removed outliers: 12631932
Avg, removed outliers: 126
NOT fenced rtdscP reads:
Sum: 12115548
Avg: 121
Sum, removed outliers: 12103236
Avg, removed outliers: 121
NOT fenced rtdsc reads:
Sum: 3335997
Avg: 33
Sum, removed outliers: 3305333
Avg, removed outliers: 35
=================== END OF AMD RYZEN EXPERIMENTS =========================
这是推土机建筑实验。
======================= AMD BULLDOZER EXPERIMENTS =========================
AMD A6-4455M
100_000 iteration
Using a *=3;
Mfenced rtdscP reads:
Sum: 32120355
Avg: 321
Sum, removed outliers: 27718117
Avg, removed outliers: 278
Mfenced rtdsc reads:
Sum: 23739715
Avg: 237
Sum, removed outliers: 23013028
Avg, removed outliers: 230
Lfenced rtdscP reads:
Sum: 14274916
Avg: 142
Sum, removed outliers: 13026199
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 11083963
Avg: 110
Sum, removed outliers: 10905271
Avg, removed outliers: 109
NOT fenced rtdscP reads:
Sum: 9361738
Avg: 93
Sum, removed outliers: 8993886
Avg, removed outliers: 90
NOT fenced rtdsc reads:
Sum: 4766349
Avg: 47
Sum, removed outliers: 4310312
Avg, removed outliers: 43
=================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 38748536
Avg: 387
Sum, removed outliers: 36719312
Avg, removed outliers: 368
Mfenced rtdsc reads:
Sum: 35106459
Avg: 351
Sum, removed outliers: 33514331
Avg, removed outliers: 335
Lfenced rtdscP reads:
Sum: 23867349
Avg: 238
Sum, removed outliers: 23203849
Avg, removed outliers: 232
Lfenced rtdsc reads:
Sum: 21991975
Avg: 219
Sum, removed outliers: 21394828
Avg, removed outliers: 215
NOT fenced rtdscP reads:
Sum: 19790942
Avg: 197
Sum, removed outliers: 19701909
Avg, removed outliers: 197
NOT fenced rtdsc reads:
Sum: 10841074
Avg: 108
Sum, removed outliers: 10583085
Avg, removed outliers: 106
=================== END OF AMD BULLDOZER EXPERIMENTS =========================
英特尔的结果是:
======================= INTEL EXPERIMENTS =========================
INTEL 4710HQ
100_000 iteration
Using a *=3
Mfenced rtdscP reads:
Sum: 10914893
Avg: 109
Sum, removed outliers: 10820879
Avg, removed outliers: 108
Mfenced rtdsc reads:
Sum: 7866322
Avg: 78
Sum, removed outliers: 7606613
Avg, removed outliers: 76
Lfenced rtdscP reads:
Sum: 4823705
Avg: 48
Sum, removed outliers: 4783842
Avg, removed outliers: 47
Lfenced rtdsc reads:
Sum: 3634106
Avg: 36
Sum, removed outliers: 3463079
Avg, removed outliers: 34
NOT fenced rtdscP reads:
Sum: 2216884
Avg: 22
Sum, removed outliers: 1435830
Avg, removed outliers: 17
NOT fenced rtdsc reads:
Sum: 1736640
Avg: 17
Sum, removed outliers: 986250
Avg, removed outliers: 12
===================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 22008705
Avg: 220
Sum, removed outliers: 16097871
Avg, removed outliers: 177
Mfenced rtdsc reads:
Sum: 13086713
Avg: 130
Sum, removed outliers: 12627094
Avg, removed outliers: 126
Lfenced rtdscP reads:
Sum: 9882409
Avg: 98
Sum, removed outliers: 9753927
Avg, removed outliers: 97
Lfenced rtdsc reads:
Sum: 8854943
Avg: 88
Sum, removed outliers: 8435847
Avg, removed outliers: 84
NOT fenced rtdscP reads:
Sum: 7302577
Avg: 73
Sum, removed outliers: 7190424
Avg, removed outliers: 71
NOT fenced rtdsc reads:
Sum: 1726126
Avg: 17
Sum, removed outliers: 1029630
Avg, removed outliers: 12
=================== END OF INTEL EXPERIMENTS =========================
从我的角度来看,AMD Ryzen应该执行得更快。我的英特尔 CPU 已经快 5 年了,而 AMD CPU 是全新的。
我找不到确切的来源,但是,我读到AMD在将架构从推土机更新到Ryzen时更改/降低了rdtsc
和rdtscp
指令的分辨率。这就是为什么当我尝试测量代码的时间时,我会得到 36 个结果的倍数。我不知道他们为什么这样做,也不知道我在哪里找到这些信息,但事实就是如此。如果您有一台 AMD 锐龙机器,我建议您运行实验并查看计时器输出。
我还没有看rdpmc
,当我阅读它时我会尝试更新。
编辑:
跟进下面的评论。
关于预热:所有实验都只是 1 C 代码。因此,即使它们在mfenced rdtscp
(第一次实验)中没有预热,它们也肯定会在以后预热。
我正在使用c
和混合inline assembly
。我只是使用gcc main.c -o main
来编译代码。AFAIK,它使用 O0 优化进行编译。GCC 是版本 7.4.0
即使为了减少时间,我也将我的函数声明为#define
,这样它们就不会从函数中调用,这意味着执行速度更快。
我如何进行实验的示例代码:
#define lfence() asm volatile("lfencen");
#define mfence() asm volatile("mfencen");
// reading the low end is enough for the measurement because I don't measure too complex result.
// For complex measurements, I need to shift and OR
#define rdtscp(_readval) asm volatile("rdtscpn": "=a"(_readval)::"rcx", "rdx");
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}
编辑2:
你为什么使用mfence?
我一开始就没有使用mfence
。我只是使用rdtscp
,做工作,rdtscp
再次找到差异。
不知道您希望通过反优化 gcc -O0 输出的周期精确计时在这里学到什么。
我没有使用任何优化,因为我想测量完成指令需要多少个周期。我将测量包含分支的代码块。如果我使用优化,优化可能会将其更改为condmove
,这会破坏测量的全部要点。
如果非内联函数调用和其他内存访问(来自禁用优化,/facepalm)被mfence化,我不会感到惊讶,这使它在Ryzen上成为36的倍数。
此外,下面是代码的反汇编版本。在测量过程中,没有内存访问(read1 和 read2 除外,我相信它们在缓存中)或调用其他函数。
9fd: 0f ae f0 mfence
a00: 0f 01 f9 rdtscp
a03: 48 89 05 36 16 20 00 mov %rax,0x201636(%rip) # 202040 <read1>
a0a: 0f ae f0 mfence
a0d: 8b 05 15 16 20 00 mov 0x201615(%rip),%eax # 202028 <a21>
a13: 83 c0 03 add $0x3,%eax #Either this or division operations for measurement
a16: 89 05 0c 16 20 00 mov %eax,0x20160c(%rip) # 202028 <a21>
a1c: 0f ae f0 mfence
a1f: 0f 01 f9 rdtscp
a22: 48 89 05 0f 16 20 00 mov %rax,0x20160f(%rip) # 202038 <read2>
a29: 0f ae f0 mfence
a2c: 48 8b 15 05 16 20 00 mov 0x201605(%rip),%rdx # 202038 <read2>
a33: 48 8b 05 06 16 20 00 mov 0x201606(%rip),%rax # 202040 <read1>
a3a: 48 29 c2 sub %rax,%rdx
a3d: 8b 85 ec ca f3 ff mov -0xc3514(%rbp),%eax
代码:
register unsigned long a21 asm("r13");
#define calculation_to_measure(){
a21 +=3;
}
#define initvars(){
read1 = 0;
read2 = 0;
a21= 21;
}
// =========== RDTSCP, double mfence ================
// Reference code, others are similar
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}
结果,我只在 AMD 锐龙机器中做了那些。|
使用gcc main.c -O0 -o rdtsc
,没有优化。它将 r13 移动到 rax。
反汇编代码:
9ac: 0f ae f0 mfence
9af: 0f 01 f9 rdtscp
9b2: 48 89 05 7f 16 20 00 mov %rax,0x20167f(%rip) # 202038 <read1>
9b9: 0f ae f0 mfence
9bc: 4c 89 e8 mov %r13,%rax
9bf: 48 83 c0 03 add $0x3,%rax
9c3: 49 89 c5 mov %rax,%r13
9c6: 0f ae f0 mfence
9c9: 0f 01 f9 rdtscp
9cc: 48 89 05 5d 16 20 00 mov %rax,0x20165d(%rip) # 202030 <read2>
9d3: 0f ae f0 mfence
结果:
Mfenced rtdscP reads:
Sum: 32846796
Avg: 328
Sum, removed outliers: 32626008
Avg, removed outliers: 327
Mfenced rtdsc reads:
Sum: 18235980
Avg: 182
Sum, removed outliers: 18108180
Avg, removed outliers: 181
Lfenced rtdscP reads:
Sum: 14351508
Avg: 143
Sum, removed outliers: 14238432
Avg, removed outliers: 142
Lfenced rtdsc reads:
Sum: 11179368
Avg: 111
Sum, removed outliers: 10994400
Avg, removed outliers: 115
NOT fenced rtdscP reads:
Sum: 6064488
Avg: 60
Sum, removed outliers: 6064488
Avg, removed outliers: 60
NOT fenced rtdsc reads:
Sum: 3306394
Avg: 33
Sum, removed outliers: 3278450
Avg, removed outliers: 35
使用gcc main.c -Og -o rdtsc_global
反汇编代码:
934: 0f ae f0 mfence
937: 0f 01 f9 rdtscp
93a: 48 89 05 f7 16 20 00 mov %rax,0x2016f7(%rip) # 202038 <read1>
941: 0f ae f0 mfence
944: 49 83 c5 03 add $0x3,%r13
948: 0f ae f0 mfence
94b: 0f 01 f9 rdtscp
94e: 48 89 05 db 16 20 00 mov %rax,0x2016db(%rip) # 202030 <read2>
955: 0f ae f0 mfence
结果:
Mfenced rtdscP reads:
Sum: 22819428
Avg: 228
Sum, removed outliers: 22796064
Avg, removed outliers: 227
Mfenced rtdsc reads:
Sum: 20630736
Avg: 206
Sum, removed outliers: 19937664
Avg, removed outliers: 199
Lfenced rtdscP reads:
Sum: 13375008
Avg: 133
Sum, removed outliers: 13374144
Avg, removed outliers: 133
Lfenced rtdsc reads:
Sum: 9840312
Avg: 98
Sum, removed outliers: 9774036
Avg, removed outliers: 97
NOT fenced rtdscP reads:
Sum: 8784684
Avg: 87
Sum, removed outliers: 8779932
Avg, removed outliers: 87
NOT fenced rtdsc reads:
Sum: 3274209
Avg: 32
Sum, removed outliers: 3255480
Avg, removed outliers: 36
使用 o1 优化:gcc main.c -O1 -o rdtsc_o1
反汇编代码:
a89: 0f ae f0 mfence
a8c: 0f 31 rdtsc
a8e: 48 89 05 a3 15 20 00 mov %rax,0x2015a3(%rip) # 202038 <read1>
a95: 0f ae f0 mfence
a98: 49 83 c5 03 add $0x3,%r13
a9c: 0f ae f0 mfence
a9f: 0f 31 rdtsc
aa1: 48 89 05 88 15 20 00 mov %rax,0x201588(%rip) # 202030 <read2>
aa8: 0f ae f0 mfence
结果:
Mfenced rtdscP reads:
Sum: 28041804
Avg: 280
Sum, removed outliers: 27724464
Avg, removed outliers: 277
Mfenced rtdsc reads:
Sum: 17936460
Avg: 179
Sum, removed outliers: 17931024
Avg, removed outliers: 179
Lfenced rtdscP reads:
Sum: 7110144
Avg: 71
Sum, removed outliers: 7110144
Avg, removed outliers: 71
Lfenced rtdsc reads:
Sum: 6691140
Avg: 66
Sum, removed outliers: 6672924
Avg, removed outliers: 66
NOT fenced rtdscP reads:
Sum: 5970888
Avg: 59
Sum, removed outliers: 5965236
Avg, removed outliers: 59
NOT fenced rtdsc reads:
Sum: 3402920
Avg: 34
Sum, removed outliers: 3280111
Avg, removed outliers: 35