c-为什么使用多个线程会导致执行速度减慢

我使用的是MacBook Air M1 2020，苹果M1 7核GPU，RAM 8GB。

问题是：我正在比较按顺序执行时大约需要11分钟的数组对。奇怪的是，我投入工作的线程越多，完成所需的时间就越多(即使不使用互斥锁)。到目前为止，我已经尝试过用2个线程和4个线程来运行它。

可能是什么问题？我认为使用4个线程会更有效率，因为我有7个可用的内核，而且执行时间(对我来说)似乎足够长，可以补偿处理多个线程所造成的开销。

这是我发现与这个问题相关的代码的一部分：

int const xylen = 1024;
static uint8_t voxelGroups[321536][xylen];
int threadCount = 4;
bool areVoxelGroupsIdentical(uint8_t firstArray[xylen], uint8_t secondArray[xylen]){
return memcmp(firstArray, secondArray, xylen*sizeof(uint8_t)) == 0;
}
void* getIdenticalVoxelGroupsCount(void* threadNumber){
for(int i = (int)threadNumber-1; i < 321536-1; i += threadCount){
for(int j = i+1; j < 321536; j++){
if(areVoxelGroupsIdentical(voxelGroups[i], voxelGroups[j])){
pthread_mutex_lock(&mutex);
identicalVoxelGroupsCount++;
pthread_mutex_unlock(&mutex);
}
}
}
return 0;
}
int main(){
// some code
pthread_create(&thread1, NULL, getIdenticalVoxelGroupsCount, (void *)1);
pthread_create(&thread2, NULL, getIdenticalVoxelGroupsCount, (void *)2);
pthread_create(&thread3, NULL, getIdenticalVoxelGroupsCount, (void *)3);
pthread_create(&thread4, NULL, getIdenticalVoxelGroupsCount, (void *)4);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_join(thread3, NULL);
pthread_join(thread4, NULL);
// some code
}

首先，锁序列化所有identicalVoxelGroupsCount增量。使用更多的线程不会加快这个部分的速度。相反，它会更慢，因为缓存线跳动：包含锁和增量变量的缓存线将从一个核心串行移动到另一个核心(请参阅：缓存一致性协议)。这通常比按顺序完成所有工作慢得多，因为将缓存线从一个核心移动到另一个核心会引入相当大的延迟。你不需要锁。相反，您可以递增局部变量，然后只执行一次最终缩减(例如，通过在getIdenticalVoxelGroupsCount末尾更新原子变量)。

此外，循环迭代的交错是无效的，因为包含voxelGroups的大多数缓存线将在线程之间共享。这不像第一点那么关键，因为线程只读取缓存行。尽管如此，这可能会增加内存吞吐量，并导致瓶颈。一种更有效的方法是将迭代拆分为相对较大的连续块。将块分割成中等粒度的瓦片以更有效地使用缓存可能会更好(尽管这种优化与并行化策略无关)。

请注意，您可以使用OpenMP在C.中轻松高效地进行此类操作

相关内容

最新更新

热门标签：