为什么通过"std::vector"迭代比通过"std::array"迭代更快

我最近问了这个问题：为什么迭代 std：：array 比迭代 std：：vector 快得多？

正如人们很快指出的那样，我的基准测试有很多缺陷。因此，当我试图修复我的基准测试时，我注意到std::vector并不比std::array慢，事实上，情况恰恰相反。

#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>
using namespace std;
constexpr int n = 100'000'000;
vector<int> v(n);
//array<int, n> v;
int main()
{
int res = 0;
auto start = chrono::steady_clock::now();
for(int x : v)
res += x;
auto end = chrono::steady_clock::now();
auto diff = end - start;
double elapsed =
std::chrono::duration_cast<
std::chrono::duration<double, std::milli>
>(end - start).count();
printf("result: %dntime: %fn", res, elapsed);
}

我试图从以前的基准测试中改进的东西：

确保我正在使用结果，因此整个循环没有被优化
使用-O3标志表示速度
使用std::chrono而不是time命令。这样我们就可以隔离要测量的部分(只是for循环)。变量的静态初始化和类似的东西不会被测量。

测量时间：

数组：

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 99.554109

向量：

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 30.734491

我只是想知道这次我做错了什么。

观看神螺栓中的拆卸

差异是由于array的内存页不驻留在进程地址空间中(全局作用域数组存储在可执行文件.bss尚未分页的部分，它是零初始化的)。而vector刚刚被分配并被零填充，所以它的内存页已经存在。

如果添加

std::fill_n(v.data(), n, 1); // included in <algorithm>

作为将页面带入(预错误)的第一行main，这使得array时间与vector时间相同。

在 Linux 上，您可以执行mlock(v.data(), v.size() * sizeof(v[0]));将页面引入地址空间，而不是这样做。有关完整详细信息，请参阅man mlock。

内存映射/分配是惰性的：第一次访问页面将导致页面错误异常(在x86上#PF)。这包括 BSS 以及文件支持的映射，如可执行文件的文本段。这些页面错误是"有效的"，因此它们不会导致 SIGSEGV 被交付;相反，内核在必要时分配一个物理页，并连接硬件页表，以便加载或存储可以重新运行，并且不会在第二次出错。

这是昂贵的，特别是如果内核没有"绕过"并在一个页面错误期间准备多个页面。 (特别是在启用了 Spectre + Meltdown 缓解的情况下，在当前的 x86-64 硬件上，用户<>内核往返成本更高。

您让std:vector的构造函数在动态分配¹之后将零写入数组。std::vector会在定时循环之外执行所有页面错误。这发生在 main 之前，而实现正在为静态对象运行构造函数。

但是数组是零初始化的，因此它被放置在 BSS 中。首先要触摸它的是你的循环。您的array<>循环为定时区域内的所有页面错误付费。

如果使用new int[n]动态分配但不初始化内存块，则会看到与静态array<>相同的行为。 (如果Linux更愿意使用透明的大页面进行动态分配而不是BSS映射，可能会稍微好一点。

libstdc++ 和 libc++ 中的

脚注 1std::vector太愚蠢了，无法利用从操作系统获取已经归零的页面，就像它使用calloc或等效物一样。如果库为清零内存提供new/delete兼容的分配器，这是可能的。

C++new/delete是残缺的 vs. malloc/free/calloc/realloc。我不知道为什么 ISO C++省略了 calloc 和 realloc：两者都对于大型分配非常有用，尤其是 realloc 用于调整可复制对象的 std：：vector 的大小，这些对象可能有空间在不复制的情况下增加其映射。但是由于new/delete不能保证与malloc/free兼容，并且new是可替换的，因此即使在引擎盖下，库也不能非常轻松地使用calloc和realloc。

另一个因素：只读使页面 CoW 映射到相同的物理零页面

当延迟分配由读取(而不是写入)触发时，它读取为零。 (BSS 页面读取为零，mmap(MAP_ANONYMOUS)的新页面读取为全零。

连接硬件页表的(软)页面错误处理程序不需要实际分配物理页面(又名页面框架)来支持该虚拟页面。相反，Linux 将干净(未写入)的匿名页面映射到单个物理清零页面。 (这适用于所有任务。

如果我们对数组进行多次传递，这会导致奇怪的情况，我们可以得到 TLB 未命中但 L1d 或 L3 命中(取决于是否巨大页面)，因为我们有多个虚拟页面指向相同的物理位置。

(一些CPU，例如AMD Ryzen，在L1d缓存中使用微标记来保存，代价是即使相同的内存映射到多个虚拟地址，缓存也只能命中一个虚拟地址。英特尔CPU使用真正的VIPT L1d缓存，真的可以得到这个效果)，

我为 Linux 制作了一个测试程序，它将使用madvise(MADV_HUGEPAGE)(鼓励内核对大页面进行碎片整理)或madvise(MADV_NOHUGEPAGE)(即使在只读情况下也禁用大页面)。

出于某种原因，Linux BSS 页面在编写它们时不使用大页面。只有阅读它们才会使用 2M 大页面(对于 L1d 或 L2 来说太大了，但确实适合 L3。但我们确实得到了所有的TLB点击)。在/proc/PID/smaps中很难看到这一点，因为未写入的记忆根本不显示为"常驻"。 (请记住，它由系统范围的零共享区域物理支持)。

根据命令行参数，我对基准代码进行了一些更改，以便在读取或写入数组的 init 传递后多次重新运行 sum 循环。重复循环使它运行的时间更长，因此我们可以获得更精确的计时，并摊销 init，以便我们从 perf 中获得有用的结果。

#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>
#include <sys/mman.h>
using namespace std;
constexpr int n = 100'000'000;
//vector<int> v(n);
alignas(4096) array<int, n> v;
//template<class T>
__attribute__((noinline))
int toucharray(volatile int *vv, int write_init) {
int res=vv[0];
for(int i=32 ; i<n ; i+=128)
if(write_init)
vv[i] = 0;
else
res += vv[i];
//      volatile int sum = res;  // noinline is fine, we don't need to stop multiple calls from CSEing
return res;
}
template <class T>
__attribute__((noinline,noclone))
int sum_container(T &vv) {
unsigned int res=0;
for(int x : vv)
res += x;
__attribute__((used)) static volatile int sink;
sink = res;  // a side-effect stops IPA from deciding that this is a pure function
return res;
}
int main(int argc, char**argv)
{
int write_init = 0;
int hugepage = 0;
if (argc>1) {
hugepage = argv[1][0] & 1;
write_init = argv[1][0] & 2;
}
int repcount = 1000;
if (argc>2)
repcount = atoi(argv[2]);
// TODO: option for no madvise.
madvise(v.data(), n*sizeof(v[0]), MADV_SEQUENTIAL);
madvise(v.data(), n*sizeof(v[0]), hugepage ? MADV_HUGEPAGE : MADV_NOHUGEPAGE);  
madvise(v.data(), n*sizeof(v[0]), MADV_WILLNEED); 
// SEQ and WILLNEED probably only matter for file-backed mappings to reduce hard page faults.
//  Probably not encouraging faultahead / around for lazy-allocation soft page fault
toucharray(v.data(), write_init);
int res = 0;
auto start = chrono::steady_clock::now();
for(int i=0; i<repcount ; i++)
res = sum_container(v);
auto end = chrono::steady_clock::now();
double elapsed =
std::chrono::duration_cast<
std::chrono::duration<double, std::milli>
>(end - start).count();
printf("result: %dntime: %fn", res, elapsed);
}

最好的情况：clang++ -O3 -march=native (Skylake) 实际上使用多个累加器展开，不像 gcc -funroll-loops 那样做一个愚蠢的工作。

在我的 Skylake i7-6700k 上，配备 DDR4-2666 DRAM，配置为 4.2GHz 最大睿频和调速器=性能 -

# using std::array<int,n>
# 0&1 = 0 -> MADV_NOHUGEPAGE.  0&2 = 0 -> read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 0 1000
result: 0
time: 1961.952394
Performance counter stats for './touchpage-array-madv-nohuge-argc.clang 0 1000':
2,017.34 msec task-clock:u              #    1.000 CPUs utilized          
50      context-switches          #    0.025 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
97,774      page-faults               #    0.048 M/sec                  
8,287,680,837      cycles                    #    4.108 GHz                    
14,500,762,859      instructions              #    1.75  insn per cycle         
13,688      mem_load_retired.l2_hit:u #    0.007 M/sec                  
12,501,329,912      mem_load_retired.l1_hit:u # 6196.927 M/sec                  
144,559      mem_inst_retired.stlb_miss_loads:u #    0.072 M/sec                  
2.017765632 seconds time elapsed
1.979410000 seconds user
0.036659000 seconds sys

请注意相当多的 TLB 未命中(mem_inst_retired.stlb_miss_loads:u在用户空间中计算第 2 级 TLB 未命中)。和 97k 页面错误。这几乎与覆盖 100M * 4 = 400MB 阵列所需的 4k 页面数量完全相同，因此我们每页有 1 个错误，没有预故障/故障。

幸运的是，Skylake有两个页面浏览单元，因此它可以并行执行两个推测性页面漫游。此外，所有数据访问都在 L1d 中命中，因此页表至少在 L2 中将保持热，从而加快了页面浏览速度。

# using array
# MADV_HUGEPAGE,  read-only init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 1 1000
result: 0
time: 5947.741408
Performance counter stats for './touchpage-array-argc.clang 1 1000':
5,951.40 msec task-clock:u              #    1.000 CPUs utilized          
9      context-switches          #    0.002 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
687      page-faults               #    0.115 K/sec                  
24,377,094,416      cycles                    #    4.096 GHz                    
14,397,054,228      instructions              #    0.59  insn per cycle         
2,183,878,846      mem_load_retired.l2_hit:u #  366.952 M/sec                  
313,684,419      mem_load_retired.l1_hit:u #   52.708 M/sec                  
13,218      mem_inst_retired.stlb_miss_loads:u #    0.002 M/sec                  
5.951530513 seconds time elapsed
5.944087000 seconds user
0.003284000 seconds sys

请注意 ~1/10 TLB 未命中，但在相同的 ~12G mem 负载中，只有 2G 在 L2 中命中，这可能要归功于成功的硬件预取。 (其余的确实在L3中击中。而且我们只有 687 个页面错误;错误和巨大页面的结合使这更加高效。

请注意，由于 L3 带宽的瓶颈，所花费的时间要高出 3 倍。

数组的写入为我们提供了两全其美的最糟糕的情况：

# using array
# MADV_HUGEPAGE (no apparent effect on BSS)  and write-init
taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 3 1000
result: 0
time: 16510.222762
Performance counter stats for './touchpage-array-argc.clang 3 1000':
17,143.35 msec task-clock:u              #    1.000 CPUs utilized          
341      context-switches          #    0.020 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
95,218      page-faults               #    0.006 M/sec                  
70,475,978,274      cycles                    #    4.111 GHz                    
17,989,948,598      instructions              #    0.26  insn per cycle         
634,015,284      mem_load_retired.l2_hit:u #   36.983 M/sec                  
107,041,744      mem_load_retired.l1_hit:u #    6.244 M/sec                  
37,715,860      mem_inst_retired.stlb_miss_loads:u #    2.200 M/sec                  
17.147615898 seconds time elapsed
16.494211000 seconds user
0.625193000 seconds sys

很多页面错误。还有更多的TLB错过。

std：：vector 版本与数组基本相同：

strace表明 madvice 不起作用，因为我没有对齐指针。 glibc/libstdc++new倾向于返回一个页面对齐 + 16 的指针，前 16 个字节中有分配器簿记。对于数组，我使用alignas(4096)来确保我可以将其传递给 madvise。

madvise(0x7f760d133010, 400000000, MADV_HUGEPAGE) = -1 EINVAL (Invalid argument)

所以无论如何，使用我的内核调整设置，它只尝试对 madvise 上的大页面进行碎片整理内存，内存是相当碎片化的 ATM。所以它最终没有使用任何巨大的页面。

taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-vector-argv.clang 3 1000
result: 0
time: 16020.821517
Performance counter stats for './touchpage-vector-argv.clang 3 1000':
16,159.19 msec task-clock:u              #    1.000 CPUs utilized          
17      context-switches          #    0.001 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
97,771      page-faults               #    0.006 M/sec                  
66,146,780,261      cycles                    #    4.093 GHz                    
15,294,999,994      instructions              #    0.23  insn per cycle         
217,426,277      mem_load_retired.l2_hit:u #   13.455 M/sec                  
842,878,166      mem_load_retired.l1_hit:u #   52.161 M/sec                  
1,788,935      mem_inst_retired.stlb_miss_loads:u #    0.111 M/sec                  
16.160982779 seconds time elapsed
16.017206000 seconds user
0.119618000 seconds sys

我不确定为什么 TLB 未命中率比 THP 只读测试高得多。也许通过接触更多内存来争用内存访问和/或逐出缓存页表最终会减慢页面浏览速度，因此 TLB 预取无法跟上。

在 ~12G 负载中，硬件预取能够使大约 1G 的负载在 L1d 或 L2 缓存中命中。

另一个因素：只读使页面 CoW 映射到相同的物理零页面

std：：vector 版本与数组基本相同：

相关内容

最新更新

热门标签：