OpenMP减慢不相关的串行循环

我有两个不相关的for循环，一个是串行执行的，一个是用OpenMP并行执行的。

我使用的OpenMP-Threads越多，下一个串行代码就越慢。

class Foo {
public:
Foo(size_t size) {
parallel_vector.resize(size, 0.0);
serial_vector.resize(size, 0.0);
}
void do_serial_work() {
std::mt19937 random_number_generator;
std::uniform_real_distribution<double> random_number_distribution{ 0.0, 1.0 };
for (size_t i = 0; i < serial_vector.size(); i++) {
serial_vector[i] = random_number_distribution(random_number_generator);
}
}
void do_parallel_work() {
#pragma omp parallel for
for (auto i = 0; i < parallel_vector.size(); ++i) {
for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
}
}
}
private:
std::vector<double> parallel_vector;
std::vector<double> serial_vector;
};
void test_with_size(size_t size, int num_threads) {
std::cout << "Testing with " << num_threads << " and size: " << size << "n";
omp_set_num_threads(num_threads);
Foo foo{ size };
long long total_dur_1 = 0;
long long total_dur_2 = 0;
for (auto i = 0; i < 500; i++) {
const auto tp_1 = std::chrono::high_resolution_clock::now();
foo.do_serial_work();

const auto tp_2 = std::chrono::high_resolution_clock::now();
foo.do_parallel_work();
const auto tp_3 = std::chrono::high_resolution_clock::now();
const auto dur_1 = std::chrono::duration_cast<std::chrono::microseconds>(tp_2 - tp_1).count();
const auto dur_2 = std::chrono::duration_cast<std::chrono::microseconds>(tp_3 - tp_2).count();
total_dur_1 += dur_1;
total_dur_2 += dur_2;
}
std::cout << total_dur_1 << "t" << total_dur_2 << "n";
}
int main(int argc, char** argv) {
test_with_size(100000, 1);
test_with_size(100000, 2);
test_with_size(100000, 4);
test_with_size(100000, 8);
return 0;
}

减速发生在我的本地机器上，一台Win10笔记本电脑，拥有4核和超线程的英特尔酷睿i7-7700, 24gb RAM。编译器是VisualStudio 2019中最新的。编译在RelWithDebugMode(从CMake，包括/O2和/openmp)。

当我使用更强大的机器时，不会发生这种情况，CentOS 8配备2倍英特尔至强白金9242，每个48核，无超线程，769 GB RAM。编译器是gcc/8.3.1。用g++ --std=c++17 -O3 -fopenmp编译。

Win10 i7-7700的计时:

Testing with 1 and size: 100000
3043846 10536315
Testing with 2 and size: 100000
3276611 5350204
Testing with 4 and size: 100000
3937311 2735655
Testing with 8 and size: 100000
5002727 1598775

和CentOS 8, 2x Xeon Platinum 9242:

Testing with 1 and size: 100000
727756  4111363
Testing with 2 and size: 100000
731649  2069257
Testing with 4 and size: 100000
734019  1056157
Testing with 8 and size: 100000
752584  544373

所以我最初的想法是"缓存压力太大"。然而，当我从平行部分移除了循环之外的几乎所有内容时，减速又发生了。

更新并行段，取出工作:

void do_parallel_work() {
#pragma omp parallel for
for (auto i = 0; i < 8; ++i) {
//for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
//    parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
//}
}
}

在Win10上更新并行段的计时:

Testing with 1 and size: 100000
3206293 636
Testing with 2 and size: 100000
3218667 2672
Testing with 4 and size: 100000
3928818 8689
Testing with 8 and size: 100000
5106605 10797

查看OpenMP 2.0标准(VS只支持2.0)(在这里找到:https://www.openmp.org/specifications/)，它在2.7.2.5中说第7,8行:

在没有显式默认子句的情况下，默认行为为与指定default(shared)相同。

在2.7.2.4中第30行:

团队中的所有线程访问共享变量的相同存储区域。

对我来说，这排除了OpenMP线程每个副本serial_vector，这是我所能想到的最后一个解释。

我很高兴听到关于那件事的任何解释/讨论，即使我显然遗漏了一些东西。

编辑:

出于好奇，我也在我的Win10机器上用WSL进行了测试。运行gcc/9.3.0，计时如下:

Testing with 1 and size: 100000
833678  2752
Testing with 2 and size: 100000
762877  1863
Testing with 4 and size: 100000
816440  1860
Testing with 8 and size: 100000
991184  2350

老实说，我不确定为什么windows可执行文件在同一台机器上花费的时间比linux长这么多(vc++的优化/O2是最大的)，但有趣的是，相同的工件在这里没有发生。

Windows上的OpenMP默认具有200ms的自旋锁。这意味着当您离开omp块时，所有omp工作线程都在旋转等待新工作。如果你有许多相邻的omp块，这是有好处的。在您的示例中，线程只是消耗CPU功率。

要禁用/控制自旋锁，您有几个选项:

定义环境变量OMP_WAIT_POLICY并将其设置为PASSIVE以完全禁用spinlocs
切换到OneAPI附带的Intel OMP Runtime。然后可以通过定义KMP_BLOCKTIME环境变量
安装Visual Studio 2019预览版(很快将在正式发布)并使用llvm omp。然后，您还可以通过定义KMP_BLOCKTIME环境变量来控制自旋锁时间。

相关内容

最新更新

热门标签：