c-如何优化缓存性能

我已经使用数组编写了一段C代码，以了解我的intel i7 8750上缓存的行为，L1d=32k，L2=258k，L3:912k，行大小为64字节，集大小为8我试图理解我从代码输出中得到的输出。如果LRU是缓存的替换策略，那么在我的代码中还能做些什么来确保我得到最小的缓存未命中？

#include<stdio.h>
#include<string.h>
#include<unistd.h>
#include<stdlib.h>
#include<time.h>
#define BILLION 1000000000L
struct student
{
char name[64];
};
int main(int argc, char* argv[])
{
int m, i, p;
char* n;
char mn[64];
u_int64_t diff; 
struct timespec start, end; 
m = strtol(argv[1], &n, 0);
struct student* arr_student = malloc(m * sizeof(struct student));
for(u_int64_t i = 0; i < m; i++ )
{      
strcpy(arr_student[i].name, "abc");
}
/* 100 runs to ensure cache warmup and linear access time calculation*/ 
for (int j = 0; j<100; j++){        
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for(u_int64_t i = 0; i < m; i+=8){
strcpy(mn,arr_student[i].name);
if(i < (m-8)){
strcpy(mn,arr_student[i+1].name);
strcpy(mn,arr_student[i+2].name);
strcpy(mn,arr_student[i+3].name);
strcpy(mn,arr_student[i+4].name);
strcpy(mn,arr_student[i+5].name);
strcpy(mn,arr_student[i+6].name);
strcpy(mn,arr_student[i+7].name);
}
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
} 
diff = BILLION * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("Time take for linear read operation only: %llu nanosecondsn", (long long unsigned int) diff / 8 );
free(arr_student);
return 0;
}

我看到一种趋势，随着数组大小的增长，执行步长为8的循环的执行时间会花费越来越多的时间。我希望它保持不变，只有当CPU必须在L2中查找时，即当阵列大小增长到L1所能容纳的范围之外时，它才会增加。我希望看到这样的结果：https://www.google.com/search?q=cache+性能+趋势+l1+l2&rlz＝1C1GCEA_ enUS831US831&source=lnms&tbm＝isch&sa＝X&ved=0ahUKEwi9jqqApYrgAhXYFjQIHR39BtwQ_AUIDygC&biw＝1280&bih=913#imgrc=5JVNAazx3drZvM:

当我把diff除以m时，为什么我会得到相反的趋势？我不理解这种趋势。

请帮忙？

这里有一些关于内存对齐和代码优化的有用技巧：

失落的结构包装艺术
用C语言优化计算机程序

一般来说，代码优化是一个时间和经验问题。

相关内容

最新更新

热门标签：