如何优化n-queens OpenMP并行程序



我正在使用OpenMP并行处理n-queens问题,但我的顺序程序也同样快。我一直在尝试使用num_threads,但我认为我做得不对。

有人能看看我的代码,告诉我我做错了什么,或者给我一些提示吗?非常感谢。

这是我的并行程序:

// Parallel version of the N-Queens problem.

#include <iostream>  
#include <omp.h>
#include <time.h>
#include <sys/time.h>
// Timing execution
double startTime, endTime;
// Number of solutions found
int numofSol = 0;
// Board size and number of queens
#define N 11
void placeQ(int queens[], int row, int column) {

for(int i = 0; i < row; i++) {
// Vertical
if (queens[i] == column) {
return;
}

// Two queens in the same diagonal
if (abs(queens[i] - column) == (row-  i))  {
return;
}
}

// Set the queen
queens[row] = column;

if(row == N-1) {

#pragma omp atomic 
numofSol++;  //Placed the final queen, found a solution

#pragma omp critical
{
std::cout << "The number of solutions found is: " << numofSol << std::endl; 
for (int row = 0; row < N; row++) {
for (int column = 0; column < N; column++) {
if (queens[row] == column) {
std::cout << "X";
}
else {
std::cout << "|";
}
}
std::cout  << "n"  << std::endl; 
}
}
}

else {

// Increment row
for(int i = 0; i < N; i++) {
placeQ(queens, row + 1, i);
}
}
} // End of placeQ()
void solve() {
#pragma omp parallel num_threads(30)
#pragma omp single
{
for(int i = 0; i < N; i++) {
// New task added for first row and each column recursion.
#pragma omp task
{ 
placeQ(new int[N], 0, i);
}
}
}
} // end of solve()
int main(int argc, char*argv[]) {
startTime = omp_get_wtime();   
solve();
endTime = omp_get_wtime();

// Print board size, number of solutions, and execution time. 
std::cout << "Board Size: " << N << std::endl; 
std::cout << "Number of solutions: " << numofSol << std::endl; 
std::cout << "Execution time: " << endTime - startTime << " seconds." << std::endl; 

return 0;
}

程序95%以上的执行时间都花在了控制台中打印字符串上,并且此操作被放在关键部分中,因此一次只能有一个线程执行。IO操作和关键部分的开销会随着使用的线程数量而增加。因此,顺序执行时间比并行执行时间好。

实际上,更准确地说,不是打印速度慢,而是由std::endl引起的与控制台的同步隐式执行std::flush,以及字符串格式化。因此,为了解决这个问题,您可以并行地准备一个线程本地字符串(std::ostringstream对此很好(。然后可以将本地字符串附加到一个大的全局字符串上,并且可以在主线程中顺序打印其内容(以防止并行IO引起的任何额外开销(并在定时部分之外打印。

除此之外,您有11个任务,并且您在代码中为此创建了30个线程,而您可能只有不到30个内核(甚至30个硬件线程(。创建过多的线程代价高昂(主要是由于线程抢占/调度(。此外,指定程序中的线程数通常是一种不好的做法。请使用可移植环境变量OMP_NUM_THREADS

以下是考虑到上述备注的代码:

// Parallel version of the N-Queens problem.
#include <iostream>  
#include <omp.h>
#include <time.h>
#include <sys/time.h>
#include <sstream>
// Timing execution
double startTime, endTime;
// Number of solutions found
int numofSol = 0;
std::ostringstream globalOss;
// Board size and number of queens
#define N 11
void placeQ(int queens[], int row, int column) {

for(int i = 0; i < row; i++) {
// Vertical
if (queens[i] == column) {
return;
}

// Two queens in the same diagonal
if (abs(queens[i] - column) == (row-  i))  {
return;
}
}

// Set the queen
queens[row] = column;

if(row == N-1) {

#pragma omp atomic 
numofSol++;  //Placed the final queen, found a solution

std::ostringstream oss;
oss << "The number of solutions found is: " << numofSol << std::endl; 
for (int row = 0; row < N; row++) {
for (int column = 0; column < N; column++) {
if (queens[row] == column) {
oss << "X";
}
else {
oss << "|";
}
}
oss  << std::endl << std::endl; 
}
#pragma omp critical
globalOss << oss.str();
}

else {

// Increment row
for(int i = 0; i < N; i++) {
placeQ(queens, row + 1, i);
}
}
} // End of placeQ()
void solve() {
#pragma omp parallel //num_threads(30)
#pragma omp single
{
for(int i = 0; i < N; i++) {
// New task added for first row and each column recursion.
#pragma omp task
{ 
placeQ(new int[N], 0, i);
}
}
}
} // end of solve()
int main(int argc, char*argv[]) {
startTime = omp_get_wtime();   
solve();
endTime = omp_get_wtime();
std::cout << globalOss.str();

// Print board size, number of solutions, and execution time. 
std::cout << "Board Size: " << N << std::endl; 
std::cout << "Number of solutions: " << numofSol << std::endl; 
std::cout << "Execution time: " << endTime - startTime << " seconds." << std::endl; 

return 0;
}

以下是在我的机器上产生的执行时间:

Time of the reference implementation (30 threads): 0.114309 s
Optimized implementation:
1 thread: 0.018634 s (x1.00)
2 thread: 0.009978 s (x1.87)
3 thread: 0.006840 s (x2.72)
4 thread: 0.005766 s (x3.23)
5 thread: 0.004941 s (x3.77)
6 thread: 0.003963 s (x4.70)

如果你想要更快的并行代码,你可以:

  • 向OpenMP提供更多任务(以改进负载平衡的工作(,但不会太多(由于每个任务的开销(
  • 减少(隐式(分配的数量
  • numofSol上执行线程本地精简,并且每个任务只使用一个原子更新

最新更新