平行区域内的OpenMP减少

如何在平行区域内进行OpenMP归约(求和(？(仅在主线程上需要结果(。

算法原型：

#pragma omp parallel
{
t = omp_get_thread_num();
while iterate 
{
float f = get_local_result(t);
// fsum is required on master only
float fsum = // ? - SUM of f
if (t == 0):
MPI_Bcast(&fsum, ...);
}

如果我在while iterate循环中有OpenMP区域，那么每次迭代的并行区域开销会降低性能。。。

这里是最简单的方法：

float sharedFsum = 0.f;
float masterFsum;
#pragma omp parallel
{
const int t = omp_get_thread_num();
while(iteration_condition)
{
float f = get_local_result(t);
// Manual reduction
#pragma omp update
sharedFsum += f;
// Ensure the reduction is completed
#pragma omp barrier
#pragma omp master
MPI_Bcast(&sharedFsum, ...);
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}

如果您有很多线程(例如数百个(，那么原子操作可能代价高昂。更好的方法是让运行时为您执行缩减。这里有一个更好的版本：

float sharedFsum = 0;
#pragma omp parallel
{
const int threadCount = omp_get_num_threads();
float masterFsum;
while(iteration_condition)
{
// Execute get_local_result on each thread and
// perform the reduction into sharedFsum
#pragma omp for reduction(+:sharedFsum) schedule(static,1)
for(int i=0 ; i<threadCount ; ++i)
sharedFsum += get_local_result(i);
#pragma omp master
{
MPI_Bcast(&sharedFsum, ...);
// sharedFsum must be reinitialized for the next iteration
sharedFsum = 0.f;
}
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}

旁注：

t在代码中不受保护，请在#pragma omp parallel部分中使用private(t)，以避免由于竞争条件而导致的未定义行为。或者，您可以使用作用域变量。
#pragma omp master应该优先于线程ID的条件。

每次迭代的并行区域开销会降低性能。。。

大多数情况下，这是由于(隐式(同步/通信或工作不平衡造成的。上面的代码可能也有同样的问题，因为它是完全同步的。如果它在您的应用程序中有意义，您可以通过消除或移动有关MPI_Bcast和get_local_result速度的障碍来降低同步性(因此可能更快(。然而，要做到这一点绝非易事。一种方法是使用OpenMP任务和多缓冲。

相关内容

最新更新

热门标签：