为什么 CPU "insn per cycle"在类似的 CPU 中不同,"MONITOR-MWAIT"如何在 Linux 中工作?



背景:我有2台服务器,所有操作系统内核版本为4.18.7,具有CONFIG_BPF_SYSCALL=y

我创建了一个shell脚本"x.sh">

i=0 
while (( i < 1000000 )) 
do (( i ++ )) 
done

并运行命令:perf stat ./x.sh

所有的shell版本都是"4.2.6(1)-发布">

S1:CPU----英特尔(R)至强(R)CPU E5-2630 v4@2.20GHz,微码----0xb00002e和性能统计结果

5391.653531      task-clock (msec)         #    1.000 CPUs utilized          
4      context-switches          #    0.001 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
107      page-faults               #    0.020 K/sec                  
12,910,036,202      cycles                    #    2.394 GHz                    
27,055,073,385      instructions              #    2.10  insn per cycle         
6,527,267,657      branches                  # 1210.624 M/sec                  
34,787,686      branch-misses             #    0.53% of all branches        
5.392121575 seconds time elapsed

S2:CPU----英特尔(R)至强(R)CPU E5-2620 v4@2.10GHz,微码----0xb00002e和性能统计结果

10688.669439      task-clock (msec)         #    1.000 CPUs utilized          
6      context-switches          #    0.001 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
105      page-faults               #    0.010 K/sec                  
24,583,857,467      cycles                    #    2.300 GHz                    
27,117,299,405      instructions              #    1.10  insn per cycle         
6,571,204,123      branches                  #  614.782 M/sec                  
32,996,513      branch-misses             #    0.50% of all branches        
10.688907278 seconds time elapsed

问题:我们可以看到cpu是相似的,os内核是相同的,但为什么perf-stat的周期如此不同!

编辑:我修改shell并命令:x.sh,减小循环时间以减少的花费时间

i=0
while (( i < 10000 )) 
do
(( i ++))
done

命令,添加更多详细信息并重复perf stat -d -d -d -r 100 ~/1.sh

结果S1:

54.007015      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.09% )
0      context-switches          #    0.002 K/sec                    ( +- 29.68% )
0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
106      page-faults               #    0.002 M/sec                    ( +-  0.12% )
128,380,832      cycles                    #    2.377 GHz                      ( +-  0.09% )  (30.52%)
252,497,672      instructions              #    1.97  insn per cycle           ( +-  0.01% )  (39.75%)
60,741,861      branches                  # 1124.703 M/sec                    ( +-  0.01% )  (40.63%)
451,011      branch-misses             #    0.74% of all branches          ( +-  0.29% )  (40.72%)
66,621,188      L1-dcache-loads           # 1233.565 M/sec                    ( +-  0.01% )  (40.76%)
52,248      L1-dcache-load-misses     #    0.08% of all L1-dcache hits    ( +-  4.55% )  (39.86%)
1,568      LLC-loads                 #    0.029 M/sec                    ( +-  9.58% )  (29.75%)
168      LLC-load-misses           #   21.47% of all LL-cache hits     ( +-  3.87% )  (29.66%)
<not supported>      L1-icache-loads                                             
672,212      L1-icache-load-misses                                         ( +-  0.85% )  (29.62%)
67,630,589      dTLB-loads                # 1252.256 M/sec                    ( +-  0.01% )  (29.62%)
1,051      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 33.11% )  (29.62%)
13,929      iTLB-loads                #    0.258 M/sec                    ( +- 17.85% )  (29.62%)
44,327      iTLB-load-misses          #  318.24% of all iTLB cache hits   ( +-  8.12% )  (29.62%)
<not supported>      L1-dcache-prefetches
<not supported>      L1-dcache-prefetch-misses                                   
0.054370018 seconds time elapsed                                          ( +-  0.08% )

S2:

106.405511      task-clock (msec)         #    0.996 CPUs utilized            ( +-  0.07% )
0      context-switches          #    0.002 K/sec                    ( +- 18.92% )
0      cpu-migrations            #    0.000 K/sec                  
106      page-faults               #    0.994 K/sec                    ( +-  0.09% )
242,242,714      cycles                    #    2.277 GHz                      ( +-  0.07% )  (30.55%)
260,394,910      instructions              #    1.07  insn per cycle           ( +-  0.01% )  (39.00%)
62,877,430      branches                  #  590.923 M/sec                    ( +-  0.01% )  (39.65%)
407,887      branch-misses             #    0.65% of all branches          ( +-  0.25% )  (39.81%)
68,137,265      L1-dcache-loads           #  640.355 M/sec                    ( +-  0.01% )  (39.84%)
70,330      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    ( +-  2.91% )  (39.38%)
3,526      LLC-loads                 #    0.033 M/sec                    ( +-  7.33% )  (30.28%)
153      LLC-load-misses           #    8.69% of all LL-cache hits     ( +-  6.29% )  (30.12%)
<not supported>      L1-icache-loads                                             
878,021      L1-icache-load-misses                                         ( +-  0.43% )  (30.09%)
68,442,021      dTLB-loads                #  643.219 M/sec                    ( +-  0.01% )  (30.07%)
9,518      dTLB-load-misses          #    0.01% of all dTLB cache hits   ( +-  2.58% )  (30.07%)
233,190      iTLB-loads                #    2.192 M/sec                    ( +-  3.73% )  (30.07%)
17,837      iTLB-load-misses          #    7.65% of all iTLB cache hits   ( +- 13.21% )  (30.07%)
<not supported>      L1-dcache-prefetches
<not supported>      L1-dcache-prefetch-misses                                   
0.106858870 seconds time elapsed                                          ( +-  0.07% )

编辑:我检查/usr/bin/sh-md5sum是否相同,并添加bash脚本头#! /usr/bin/sh,结果与之前相同

编辑:使用命令perf diff perf.data.s2 perf.data.s1我发现了一些有价值的差异

首次展示一些警告:

/usr/lib64/ld-2.17.so with build id 93d2e4a501823d041413eeb652b89044d1f680ee not found, continuing without symbols
/usr/lib64/libc-2.17.so with build id b04a54c443d36058702ab4060c63f4ab3273eae9 not found, continuing without symbols

并发现rpm版本不同。

性能差异显示:

# Event 'cycles'
#
# Baseline    Delta  Shared Object      Symbol
# ........  .......  .................  ..............................................
#
21.20%   +3.83%  bash               [.] 0x000000000002c0f0
10.22%           libc-2.17.so       [.] _int_free
9.11%           libc-2.17.so       [.] _int_malloc
7.97%           libc-2.17.so       [.] malloc
4.09%           libc-2.17.so       [.] __gconv_transform_utf8_internal
3.71%           libc-2.17.so       [.] __mbrtowc
3.48%   -1.63%  bash               [.] execute_command_internal
3.48%   +1.18%  [unknown]          [k] 0xfffffe0000032000
3.25%   -1.87%  bash               [.] xmalloc
3.12%           libc-2.17.so       [.] __strcpy_sse2_unaligned
2.44%   +2.22%  [kernel.kallsyms]  [k] syscall_return_via_sysret
2.09%   -0.24%  bash               [.] evalexp
2.09%           libc-2.17.so       [.] __ctype_get_mb_cur_max
1.92%           libc-2.17.so       [.] free
1.41%   -0.95%  bash               [.] dequote_string
1.19%   +0.23%  bash               [.] stupidly_hack_special_variables
1.16%           libc-2.17.so       [.] __strlen_sse2_pminub
1.16%           libc-2.17.so       [.] __memcpy_ssse3_back
1.16%           libc-2.17.so       [.] __strcmp_sse42
0.93%   -0.01%  bash               [.] mbschr
0.93%   -0.47%  bash               [.] hash_search
0.70%           libc-2.17.so       [.] __sigprocmask
0.70%   -0.23%  bash               [.] dispose_words
0.70%   -0.23%  bash               [.] execute_command
0.70%   -0.23%  bash               [.] set_pipestatus_array
0.70%           bash               [.] run_pending_traps
0.47%           bash               [.] malloc@plt
0.47%           bash               [.] var_lookup
0.47%           bash               [.] fmtumax
0.47%           bash               [.] do_redirections
0.46%           bash               [.] dispose_word
0.46%   -0.00%  bash               [.] alloc_word_desc
0.46%   -0.00%  [kernel.kallsyms]  [k] _copy_to_user
0.46%           libc-2.17.so       [.] __ctype_b_loc
0.46%           bash               [.] new_fd_bitmap
0.46%           bash               [.] add_unwind_protect
0.46%   -0.00%  bash               [.] discard_unwind_frame
0.46%           bash               [.] memcpy@plt
0.46%           bash               [.] __ctype_get_mb_cur_max@plt
0.46%           bash               [.] signal_in_progress
0.40%           libc-2.17.so       [.] _IO_vfscanf
0.40%           ld-2.17.so         [.] do_lookup_x
0.27%           bash               [.] mbrtowc@plt
0.24%   +1.60%  [kernel.kallsyms]  [k] __x64_sys_rt_sigprocmask
0.23%           bash               [.] list_append
0.23%           bash               [.] bind_variable
0.23%   +0.69%  [kernel.kallsyms]  [k] entry_SYSCALL_64_stage2
0.23%   +0.69%  [kernel.kallsyms]  [k] do_syscall_64
0.23%           libc-2.17.so       [.] _dl_mcount_wrapper_check
0.23%   +0.69%  bash               [.] make_word_list
0.23%   +0.69%  [kernel.kallsyms]  [k] copy_user_generic_unrolled
0.23%           [kernel.kallsyms]  [k] unmap_page_range
0.23%           libc-2.17.so       [.] __sigjmp_save
0.23%   +0.23%  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
0.20%           [kernel.kallsyms]  [k] swapgs_restore_regs_and_return_to_usermode
0.03%           [kernel.kallsyms]  [k] page_fault
0.00%           [kernel.kallsyms]  [k] xfs_bmapi_read
0.00%           [kernel.kallsyms]  [k] xfs_release
0.00%   +0.00%  [kernel.kallsyms]  [k] native_write_msr
+45.33%  libc-2.17.so       [.] 0x0000000000027cc6
+0.52%  [kernel.kallsyms]  [k] __mod_node_page_state
+0.46%  bash               [.] free@plt
+0.46%  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
+0.46%  bash               [.] begin_unwind_frame
+0.46%  bash               [.] make_bare_word
+0.46%  bash               [.] find_variable_internal
+0.37%  ld-2.17.so         [.] 0x0000000000009b13

也许油嘴滑舌的区别就是答案

编辑:最后,我检查了BIOS的配置,发现S2服务器使用节能模式,这才是真正的答案!

但是,BIOS的配置让我混淆了MONITOR-MWAIT,即使使用"最大性能模式"one_answers"MONITOR-MW AIT">启用,S2的性能也很差。使用命令cpupower idle-info -o,请参阅cpu使用"C状态",该状态已在"最大性能模式"中禁用。它必须是禁用加上"最大性能模式",性能才会更好。

"MONITOR-MWAIT"的描述说,一些操作系统会检查这个选项来恢复"C状态",我找不到Linux内核是如何使用它来更改"C状态"的。。。

我找到了答案。

首先,让我们看看内核4.18.7中BIOS的MONITOR/MWAIT选项。在该内核中,它将使用intel_idle驱动程序,该驱动程序只检查系统是否支持mwait指令,而不关心C状态是否启用。一旦使用MONITOR/MWAIT指令,就会使用intel_idle驱动程序,并强制使用C状态,看起来像是使用省电模式。

第二,为什么每个周期的insn不同?因为,使用了调优的服务,而活动配置文件是"延迟性能",force_latency是1us。如果你使用C状态,将使用C状态级别,其延迟小于force_latency;

# cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
analyzing CPU 0:
Number of idle states: 5
Available idle states: POLL C1 C1E C3 C6
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 13034605
Duration: 820867557
C1:
Flags/Description: MWAIT 0x00
Latency: 2
Usage: 349471619
Duration: 344311623672
C1E:
Flags/Description: MWAIT 0x01
Latency: 10
Usage: 237
Duration: 55999
C3:
Flags/Description: MWAIT 0x10
Latency: 40
Usage: 350
Duration: 168988
C6:
Flags/Description: MWAIT 0x20
Latency: 133
Usage: 3696
Duration: 17809893

您将只看到延迟小于1us的POLL级别,并且POLL级别将强制CPU使用NOP指令运行。在这种情况下,如果使用"超线程"技术,将使执行指令的速度下降一半。因为两个逻辑核心将共享一个ALU,而其中一个正在运行NOP指令,导致另一个必须等待它

如果您将MONITOR/MWAIT选项设为禁用,则intel_idle驱动程序将被禁用为,因此将不使用调优服务的force_latency,并且逻辑核心中的一个将停止,使另一个使用ALU独占性。

最后,感谢每一个男孩,特别是@Peter Cordes和@osgx,让我检查BIOS,命令echo 2^1234567%2 | bc非常漂亮!

相关内容

  • 没有找到相关文章

最新更新