C语言 一个线程计数,另一个线程执行工作和测量



我想实现一个 2 线程模型,其中 1 个正在计数(无限递增一个值),另一个正在记录第一个计数器,完成工作,记录第二个记录并测量两者之间经过的时间。

这是我到目前为止所做的:

// global counter
register unsigned long counter asm("r13");
// unsigned long counter;
void* counter_thread(){
// affinity is set to some isolated CPU so the noise will be minimal
while(1){
//counter++; // Line 1*
asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
}
}
void* measurement_thread(){
// affinity is set somewhere over here
unsigned long meas = 0;
unsigned long a = 5;
unsigned long r1,r2;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a *=3; // dummy operation that I want to measure
r2 = counter;
mfence();
meas = r2-r1;
printf("counter:%ld n", counter);
break;
}
}

让我解释一下到目前为止我所做的工作:

由于我希望计数器准确,因此我正在设置与隔离 CPU 的关联。另外,如果我在第 1 行*中使用计数器,则分解函数将是:

d4c:   4c 89 e8                mov    %r13,%rax
d4f:   48 83 c0 01             add    $0x1,%rax
d53:   49 89 c5                mov    %rax,%r13
d56:   eb f4                   jmp    d4c <counter_thread+0x37>

这不是 1 个循环操作。这就是为什么我使用内联组装来减少 2 mov 指令的原因。使用内联程序集:

d4c:   49 83 c5 01             add    $0x1,%r13
d50:   eb fa                   jmp    d4c <counter_thread+0x37>

但问题是,这两种实现都不起作用。另一个线程看不到正在更新的计数器。如果我使全局计数器值不是寄存器,那么它就可以工作,但我想精确。如果我将全局计数器值设为unsigned long counter则计数器线程的反汇编代码为:

d4c:   48 8b 05 ed 12 20 00    mov    0x2012ed(%rip),%rax        # 202040 <counter>
d53:   48 83 c0 01             add    $0x1,%rax
d57:   48 89 05 e2 12 20 00    mov    %rax,0x2012e2(%rip)        # 202040 <counter>
d5e:   eb ec                   jmp    d4c <counter_thread+0x37>

它有效,但它没有给我想要的粒度。

编辑

我的环境:

  • 中央处理器: AMD 锐龙 3600
  • 内核:5.0.0-32 通用
  • 操作系统: 乌班图 18.04

编辑2:我隔离了2个相邻CPU内核(即核心10和11),并在这些内核上运行实验。计数器在其中一个核心上,测量在另一个内核上。隔离是通过使用/etc/default/grub 文件并添加 isolcpus 线来完成的。

编辑3:我知道一次测量是不够的。我已经运行了1000万次实验并查看了结果。

实验1: 设置:

unsigned long counter =0;//global counter 
void* counter_thread(){
mfence();
while(1)
counter++;
}
void* measurement_thread(){
unsigned long i=0, r1=0,r2=0;
unsigned int a=0;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a +=3;
r2 = counter;
mfence();
measurements[r2-r1]++;
i++;
if(i == MILLION_ITER)
break;   
}
}

结果1: 在 99.99% 中,我得到了 0。我期望这是因为第一个线程未运行,或者操作系统或其他中断干扰了测量。去掉 0 和非常高的值,平均给我 20 个测量周期。(我期待 3-4,因为我只做整数加法)。

实验2

设置:与上面相同,一个区别是,我使用计数器作为寄存器,而不是全局计数器:

register unsigned long counter asm("r13");

结果 2:测量线程始终读取 0。在反汇编的代码中,我可以看到两者都在处理 R13 寄存器(计数器),但是,我相信它没有以某种方式共享。

实验3

设置:与setup2相同,除了在计数器线程中,我不是做计数器++,而是做一个内联程序集,以确保我正在做一个循环操作。我的反汇编文件如下所示:

cd1:   49 83 c5 01             add    $0x1,%r13
cd5:   eb fa                   jmp    cd1 <counter_thread+0x37>

结果3:测量线程读数为0,如上所示。

每个线程都有自己的寄存器。每个逻辑 CPU 内核都有自己的体系结构寄存器,线程在内核上运行时使用这些寄存器。 只有信号处理程序(或在裸机上,中断)可以修改其线程的寄存器。

像在多线程程序中的... asm("r13")一样声明 GNU C asm 寄存器全局有效地为您提供线程本地存储,而不是真正的共享全局存储。

线程之间仅共享内存,而不在寄存器之间共享内存。 这就是多个线程可以同时运行而不相互踩踏的方式,每个线程都使用它们的寄存器。

编译器可以自由使用未声明为寄存器全局的寄存器,因此在内核之间共享它们根本不起作用。 (GCC 无法使它们共享与私有,具体取决于您如何声明它们。

即使除此之外,寄存器全局不是volatileatomic所以r1 = counter;r2 = counter;可以 CSE 所以r2-r1是一个编译时常数零,即使你的本地 R13 正在从信号处理程序更改。


如何确保两个线程都使用寄存器对计数器值进行读/写操作?

你不能这么做。内核之间没有共享状态,可以以比缓存更低的延迟进行读取/写入。

如果要对某些内容进行计时,请考虑使用rdtsc来获取参考周期,或rdpmc读取性能计数器(您可能已将其设置为计算内核时钟周期)。

使用另一个线程来递增计数器是不必要的,并且没有帮助,因为没有非常低开销的方法可以从另一个内核读取某些内容。


我的机器中的rdtscp指令最多只能提供36-72-108...周期分辨率。因此,我无法区分 2 个周期和 35 个周期之间的区别,因为它们都会给出 36 个周期。

那么你用错rdtsc。 它不是序列化的,因此您需要围绕定时区域进行lfence。 请参阅我的答案 如何从C++获取 CPU 周期计数x86_64?. 但是,是的,rdtsc很昂贵,rdpmc开销略低。

但更重要的是,你不能用周期中的单个成本来有效地衡量 C 语言中的a *=3;。 首先,它可以根据上下文进行不同的编译。

但假设正常lea eax, [rax + rax*2]一个现实的指令成本模型有3个维度:uop计数(前端),后端端口压力和从输入到输出的延迟。 https://agner.org/optimize/

请参阅我在 NASM 中 RDTSCP 上的答案,始终返回相同的值,以获取有关对单个指令进行计时的更多信息。 以不同的方式将其置于循环中以测量吞吐量与延迟,并查看性能计数器以获取 uops->ports。 或者看看Agner Fog的说明书和 https://uops.info/因为人们已经做了这些测试。

  • 每个汇编指令需要多少个 CPU 周期?
  • 预测现代超标量处理器上的操作延迟需要考虑哪些因素,以及如何手动计算它们?
  • 现代 x86 成本模型

同样,这些是你如何计时单个asm指令,而不是C语句。 启用优化后,C 语句的成本可能取决于它如何优化到周围的代码中。 (和/或周围操作的延迟是否隐藏了其成本,在像所有现代 x86 CPU 一样的无序执行 CPU 上。

那么你用错了rdtsc。它不是序列化的,所以你需要 lfence 在定时区域周围。查看我关于如何获取 CPU 周期的答案 从C++算x86_64?。但是,是的,rdtsc很昂贵,而rdpmc是 只是开销略低。

还行。我做了功课。

首先要做的事。我知道rdtscp是序列化指令。我不是在谈论rdtsc,最后有一封P信。

我已经检查了英特尔和AMD手册。

  • 英特尔手册页,第 83 页,表 2-3。系统摘要 指示
  • AMD 手册页 403-406

如果我错了,请纠正我,但是,从我读到的内容来看,我明白我不需要rdtscp前后fence指令,因为它是序列化指令,对吧?

第二件事是,我确实在我的 3 台机器上运行了一些实验。以下是结果

锐龙实验

======================= AMD RYZEN EXPERIMENTS =========================
RYZEN 3600
100_000 iteration
Using a *=3
Not that, almost all sums are divisible by 36, which is my machine's timer resolution. 
I also checked where the sums are not divisible by 36. 
This is the case where I don't use fence instructions with rdtsc. 
It turns out that the read value is either 35, or 1, 
which I believe the instruction(rdtsc) cannot read the value correctly.
Mfenced rtdscP reads:
Sum:            25884432
Avg:            258
Sum, removed outliers:  25800120
Avg, removed outliers:  258
Mfenced rtdsc reads:
Sum:            17579196
Avg:            175
Sum, removed outliers:  17577684
Avg, removed outliers:  175
Lfenced rtdscP reads:
Sum:            7511688
Avg:            75
Sum, removed outliers:  7501608
Avg, removed outliers:  75
Lfenced rtdsc reads:
Sum:            7024428
Avg:            70
Sum, removed outliers:  7015248
Avg, removed outliers:  70
NOT fenced rtdscP reads:
Sum:            6024888
Avg:            60
Sum, removed outliers:  6024888
Avg, removed outliers:  60
NOT fenced rtdsc reads:
Sum:            3274866
Avg:            32
Sum, removed outliers:  3232913
Avg, removed outliers:  35
======================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum:            36217404
Avg:            362
Sum, removed outliers:  36097164
Avg, removed outliers:  361
Mfenced rtdsc reads:
Sum:            22973400
Avg:            229
Sum, removed outliers:  22939236
Avg, removed outliers:  229
Lfenced rtdscP reads:
Sum:            13178196
Avg:            131
Sum, removed outliers:  13177872
Avg, removed outliers:  131
Lfenced rtdsc reads:
Sum:            12631932
Avg:            126
Sum, removed outliers:  12631932
Avg, removed outliers:  126
NOT fenced rtdscP reads:
Sum:            12115548
Avg:            121
Sum, removed outliers:  12103236
Avg, removed outliers:  121
NOT fenced rtdsc reads:
Sum:            3335997
Avg:            33
Sum, removed outliers:  3305333
Avg, removed outliers:  35
=================== END OF AMD RYZEN EXPERIMENTS =========================

这是推土机建筑实验。

======================= AMD BULLDOZER EXPERIMENTS =========================
AMD A6-4455M
100_000 iteration
Using a *=3;
Mfenced rtdscP reads:
Sum:            32120355
Avg:            321
Sum, removed outliers:  27718117
Avg, removed outliers:  278
Mfenced rtdsc reads:
Sum:            23739715
Avg:            237
Sum, removed outliers:  23013028
Avg, removed outliers:  230
Lfenced rtdscP reads:
Sum:            14274916
Avg:            142
Sum, removed outliers:  13026199
Avg, removed outliers:  131
Lfenced rtdsc reads:
Sum:            11083963
Avg:            110
Sum, removed outliers:  10905271
Avg, removed outliers:  109
NOT fenced rtdscP reads:
Sum:            9361738
Avg:            93
Sum, removed outliers:  8993886
Avg, removed outliers:  90
NOT fenced rtdsc reads:
Sum:            4766349
Avg:            47
Sum, removed outliers:  4310312
Avg, removed outliers:  43

=================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum:            38748536
Avg:            387
Sum, removed outliers:  36719312
Avg, removed outliers:  368
Mfenced rtdsc reads:
Sum:            35106459
Avg:            351
Sum, removed outliers:  33514331
Avg, removed outliers:  335
Lfenced rtdscP reads:
Sum:            23867349
Avg:            238
Sum, removed outliers:  23203849
Avg, removed outliers:  232
Lfenced rtdsc reads:
Sum:            21991975
Avg:            219
Sum, removed outliers:  21394828
Avg, removed outliers:  215
NOT fenced rtdscP reads:
Sum:            19790942
Avg:            197
Sum, removed outliers:  19701909
Avg, removed outliers:  197
NOT fenced rtdsc reads:
Sum:            10841074
Avg:            108
Sum, removed outliers:  10583085
Avg, removed outliers:  106
=================== END OF AMD BULLDOZER EXPERIMENTS =========================

英特尔的结果是:

======================= INTEL EXPERIMENTS =========================
INTEL 4710HQ
100_000 iteration
Using a *=3
Mfenced rtdscP reads:
Sum:            10914893
Avg:            109
Sum, removed outliers:  10820879
Avg, removed outliers:  108
Mfenced rtdsc reads:
Sum:            7866322
Avg:            78
Sum, removed outliers:  7606613
Avg, removed outliers:  76
Lfenced rtdscP reads:
Sum:            4823705
Avg:            48
Sum, removed outliers:  4783842
Avg, removed outliers:  47
Lfenced rtdsc reads:
Sum:            3634106
Avg:            36
Sum, removed outliers:  3463079
Avg, removed outliers:  34
NOT fenced rtdscP reads:
Sum:            2216884
Avg:            22
Sum, removed outliers:  1435830
Avg, removed outliers:  17
NOT fenced rtdsc reads:
Sum:            1736640
Avg:            17
Sum, removed outliers:  986250
Avg, removed outliers:  12
===================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum:            22008705
Avg:            220
Sum, removed outliers:  16097871
Avg, removed outliers:  177
Mfenced rtdsc reads:
Sum:            13086713
Avg:            130
Sum, removed outliers:  12627094
Avg, removed outliers:  126
Lfenced rtdscP reads:
Sum:            9882409
Avg:            98
Sum, removed outliers:  9753927
Avg, removed outliers:  97
Lfenced rtdsc reads:
Sum:            8854943
Avg:            88
Sum, removed outliers:  8435847
Avg, removed outliers:  84
NOT fenced rtdscP reads:
Sum:            7302577
Avg:            73
Sum, removed outliers:  7190424
Avg, removed outliers:  71
NOT fenced rtdsc reads:
Sum:            1726126
Avg:            17
Sum, removed outliers:  1029630
Avg, removed outliers:  12
=================== END OF INTEL EXPERIMENTS =========================

从我的角度来看,AMD Ryzen应该执行得更快。我的英特尔 CPU 已经快 5 年了,而 AMD CPU 是全新的。

我找不到确切的来源,但是,我读到AMD在将架构从推土机更新到Ryzen时更改/降低了rdtscrdtscp指令的分辨率。这就是为什么当我尝试测量代码的时间时,我会得到 36 个结果的倍数。我不知道他们为什么这样做,也不知道我在哪里找到这些信息,但事实就是如此。如果您有一台 AMD 锐龙机器,我建议您运行实验并查看计时器输出。

我还没有看rdpmc,当我阅读它时我会尝试更新。

编辑:

跟进下面的评论。

关于预热:所有实验都只是 1 C 代码。因此,即使它们在mfenced rdtscp(第一次实验)中没有预热,它们也肯定会在以后预热。

我正在使用c和混合inline assembly。我只是使用gcc main.c -o main来编译代码。AFAIK,它使用 O0 优化进行编译。GCC 是版本 7.4.0

即使为了减少时间,我也将我的函数声明为#define,这样它们就不会从函数中调用,这意味着执行速度更快。

我如何进行实验的示例代码:

#define lfence() asm volatile("lfencen");
#define mfence() asm volatile("mfencen");
// reading the low end is enough for the measurement because I don't measure too complex result. 
// For complex measurements, I need to shift and OR
#define rdtscp(_readval) asm volatile("rdtscpn": "=a"(_readval)::"rcx", "rdx");
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}

编辑2:

你为什么使用mfence?

我一开始就没有使用mfence。我只是使用rdtscp,做工作,rdtscp再次找到差异。

不知道您希望通过反优化 gcc -O0 输出的周期精确计时在这里学到什么。

我没有使用任何优化,因为我想测量完成指令需要多少个周期。我将测量包含分支的代码块。如果我使用优化,优化可能会将其更改为condmove,这会破坏测量的全部要点。

如果非内联函数调用和其他内存访问(来自禁用优化,/facepalm)被mfence化,我不会感到惊讶,这使它在Ryzen上成为36的倍数。

此外,下面是代码的反汇编版本。在测量过程中,没有内存访问(read1 和 read2 除外,我相信它们在缓存中)或调用其他函数。

9fd:   0f ae f0                mfence 
a00:   0f 01 f9                rdtscp 
a03:   48 89 05 36 16 20 00    mov    %rax,0x201636(%rip)        # 202040 <read1>
a0a:   0f ae f0                mfence 
a0d:   8b 05 15 16 20 00       mov    0x201615(%rip),%eax        # 202028 <a21>
a13:   83 c0 03                add    $0x3,%eax #Either this or division operations for measurement
a16:   89 05 0c 16 20 00       mov    %eax,0x20160c(%rip)        # 202028 <a21>
a1c:   0f ae f0                mfence 
a1f:   0f 01 f9                rdtscp 
a22:   48 89 05 0f 16 20 00    mov    %rax,0x20160f(%rip)        # 202038 <read2>
a29:   0f ae f0                mfence 
a2c:   48 8b 15 05 16 20 00    mov    0x201605(%rip),%rdx        # 202038 <read2>
a33:   48 8b 05 06 16 20 00    mov    0x201606(%rip),%rax        # 202040 <read1>
a3a:   48 29 c2                sub    %rax,%rdx
a3d:   8b 85 ec ca f3 ff       mov    -0xc3514(%rbp),%eax

代码:

register unsigned long a21 asm("r13");
#define calculation_to_measure(){
a21 +=3;
}
#define initvars(){
read1 = 0;
read2 = 0;
a21= 21;
}
// =========== RDTSCP, double mfence ================
// Reference code, others are similar
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}

结果,我只在 AMD 锐龙机器中做了那些。|

使用gcc main.c -O0 -o rdtsc,没有优化。它将 r13 移动到 rax。

反汇编代码:

9ac:   0f ae f0                mfence 
9af:   0f 01 f9                rdtscp 
9b2:   48 89 05 7f 16 20 00    mov    %rax,0x20167f(%rip)        # 202038 <read1>
9b9:   0f ae f0                mfence 
9bc:   4c 89 e8                mov    %r13,%rax
9bf:   48 83 c0 03             add    $0x3,%rax
9c3:   49 89 c5                mov    %rax,%r13
9c6:   0f ae f0                mfence 
9c9:   0f 01 f9                rdtscp 
9cc:   48 89 05 5d 16 20 00    mov    %rax,0x20165d(%rip)        # 202030 <read2>
9d3:   0f ae f0                mfence 

结果:

Mfenced rtdscP reads:
Sum:            32846796
Avg:            328
Sum, removed outliers:  32626008
Avg, removed outliers:  327
Mfenced rtdsc reads:
Sum:            18235980
Avg:            182
Sum, removed outliers:  18108180
Avg, removed outliers:  181
Lfenced rtdscP reads:
Sum:            14351508
Avg:            143
Sum, removed outliers:  14238432
Avg, removed outliers:  142
Lfenced rtdsc reads:
Sum:            11179368
Avg:            111
Sum, removed outliers:  10994400
Avg, removed outliers:  115
NOT fenced rtdscP reads:
Sum:            6064488
Avg:            60
Sum, removed outliers:  6064488
Avg, removed outliers:  60
NOT fenced rtdsc reads:
Sum:            3306394
Avg:            33
Sum, removed outliers:  3278450
Avg, removed outliers:  35

使用gcc main.c -Og -o rdtsc_global

反汇编代码:

934:   0f ae f0                mfence 
937:   0f 01 f9                rdtscp 
93a:   48 89 05 f7 16 20 00    mov    %rax,0x2016f7(%rip)        # 202038 <read1>
941:   0f ae f0                mfence 
944:   49 83 c5 03             add    $0x3,%r13
948:   0f ae f0                mfence 
94b:   0f 01 f9                rdtscp 
94e:   48 89 05 db 16 20 00    mov    %rax,0x2016db(%rip)        # 202030 <read2>
955:   0f ae f0                mfence 

结果:

Mfenced rtdscP reads:
Sum:            22819428
Avg:            228
Sum, removed outliers:  22796064
Avg, removed outliers:  227
Mfenced rtdsc reads:
Sum:            20630736
Avg:            206
Sum, removed outliers:  19937664
Avg, removed outliers:  199
Lfenced rtdscP reads:
Sum:            13375008
Avg:            133
Sum, removed outliers:  13374144
Avg, removed outliers:  133
Lfenced rtdsc reads:
Sum:            9840312
Avg:            98
Sum, removed outliers:  9774036
Avg, removed outliers:  97
NOT fenced rtdscP reads:
Sum:            8784684
Avg:            87
Sum, removed outliers:  8779932
Avg, removed outliers:  87
NOT fenced rtdsc reads:
Sum:            3274209
Avg:            32
Sum, removed outliers:  3255480
Avg, removed outliers:  36

使用 o1 优化:gcc main.c -O1 -o rdtsc_o1

反汇编代码:

a89:   0f ae f0                mfence 
a8c:   0f 31                   rdtsc  
a8e:   48 89 05 a3 15 20 00    mov    %rax,0x2015a3(%rip)        # 202038 <read1>
a95:   0f ae f0                mfence 
a98:   49 83 c5 03             add    $0x3,%r13
a9c:   0f ae f0                mfence 
a9f:   0f 31                   rdtsc  
aa1:   48 89 05 88 15 20 00    mov    %rax,0x201588(%rip)        # 202030 <read2>
aa8:   0f ae f0                mfence 

结果:

Mfenced rtdscP reads:
Sum:            28041804
Avg:            280
Sum, removed outliers:  27724464
Avg, removed outliers:  277
Mfenced rtdsc reads:
Sum:            17936460
Avg:            179
Sum, removed outliers:  17931024
Avg, removed outliers:  179
Lfenced rtdscP reads:
Sum:            7110144
Avg:            71
Sum, removed outliers:  7110144
Avg, removed outliers:  71
Lfenced rtdsc reads:
Sum:            6691140
Avg:            66
Sum, removed outliers:  6672924
Avg, removed outliers:  66
NOT fenced rtdscP reads:
Sum:            5970888
Avg:            59
Sum, removed outliers:  5965236
Avg, removed outliers:  59
NOT fenced rtdsc reads:
Sum:            3402920
Avg:            34
Sum, removed outliers:  3280111
Avg, removed outliers:  35

最新更新