我正在测试桌面和服务器上的内存带宽。
Sklyake desktop 4 cores/8 hardware threads
Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads
系统的峰值带宽为
Peak bandwidth desktop = 2-channels*8*2400 = 38.4 GB/s
Peak bandwidth server = 6-channels*2-sockets*8*2666 = 255.94 GB/s
我正在使用我自己的三合会功能来测量带宽(稍后完整代码(
void triad(double *a, double *b, double *c, double scalar, size_t n) {
#pragma omp parallel for
for(int i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}
这是我得到的结果
Bandwidth (GB/s)
threads Desktop Server
1 28 16
2(24) 29 146
4(48) 25 177
8(96) 24 189
对于1个线程,我不明白为什么桌面比服务器快得多。根据此答案,https://stackoverflow.com/a/18159503/2542702 SSE足以获得双通道系统的完整带宽。这就是我在桌面上观察到的。两个线程仅有助于略有帮助,4和8线程给出了更差的结果,但是在服务器上,单线螺纹带宽要少得多。为什么这是?
在服务器上,我使用96个线程获得了最佳结果。我本来会认为它会被较少的线程所饱和。为什么需要这么多线程来饱和服务器上的带宽?我的结果中存在很大的错误余量,而我不包括错误估算。我取得了几次跑步的最佳结果。
代码
//gcc -O3 -march=native triad.c -fopenmp
//gcc -O3 -march=skylake-avx512 -mprefer-vector-width=512 triad.c -fopenmp
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>
void triad_init(double *a, double *b, double *c, double k, size_t n) {
#pragma omp parallel for
for(size_t i=0; i<n; i++) a[i] = k, b[i] = k, c[i] = k;
}
void triad(double *a, double *b, double *c, double scalar, size_t n) {
#pragma omp parallel for
for(size_t i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}
void triad_stream(double *a, double *b, double *c, double scalar, size_t n) {
#if defined ( __AVX512F__ ) || defined ( __AVX512__ )
__m512d scalarv = _mm512_set1_pd(scalar);
#pragma omp parallel for
for(size_t i=0; i<n/8; i++) {
__m512d bv = _mm512_load_pd(&b[8*i]), cv = _mm512_load_pd(&c[8*i]);
_mm512_stream_pd(&a[8*i], _mm512_add_pd(bv, _mm512_mul_pd(scalarv, cv)));
}
#else
__m256d scalarv = _mm256_set1_pd(scalar);
#pragma omp parallel for
for(size_t i=0; i<n/4; i++) {
__m256d bv = _mm256_load_pd(&b[4*i]), cv = _mm256_load_pd(&c[4*i]);
_mm256_stream_pd(&a[4*i], _mm256_add_pd(bv, _mm256_mul_pd(scalarv, cv)));
}
#endif
}
int main(void) {
size_t n = 1LL << 31LL;
double *a = _mm_malloc(sizeof *a * n, 64), *b = _mm_malloc(sizeof *b * n, 64), *c = _mm_malloc(sizeof *c * n, 64);
//double peak_bw = 2*8*2400*1E-3; // 2-channels*8-bits/byte*2400MHz
double peak_bw = 2*6*8*2666*1E-3; // 2-sockets*6-channels*8-bits/byte*2666MHz
double dtime, mem, bw;
printf("peak bandwidth %.2f GB/sn", peak_bw);
triad_init(a, b, c, 3.14159, n);
dtime = -omp_get_wtime();
triad(a, b, c, 3.14159, n);
dtime += omp_get_wtime();
mem = 4*sizeof(double)*n*1E-9, bw = mem/dtime;
printf("triad: %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%n", mem, dtime, bw, 100*bw/peak_bw);
triad_init(a, b, c, 3.14159, n);
dtime = -omp_get_wtime();
triad_stream(a, b, c, 3.14159, n);
dtime += omp_get_wtime();
mem = 3*sizeof(double)*n*1E-9, bw = mem/dtime;
printf("triads: %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%n", mem, dtime, bw, 100*bw/peak_bw);
}
硬件预摘要在服务器与WorkStation CPU上的调整不同。期望服务器处理许多线程,因此预摘要将要求RAM较小的块。这是一篇有关您遇到的问题的详细介绍的论文,但从硬币的另一侧进行:
硬件预摘要侵略性控制器:我们是否一直都需要它们?