c-为什么使用OpenMP的多线程嵌套for循环不会更快



我编写了一个具有以下5个参数的函数:

  1. 文本的行数组
  2. 一些单词的数组
  3. 线路数量(约40k)
  4. 字数(约4-5)
  5. 要创建的线程数(k,尝试了许多值使得2<k<20)

函数在for循环中搜索给定的单词,在每次迭代中扫描每行一个单词(outer for)。

内部for环路:

  • 循环通过线条
  • 检查该行是否包含给定的单词
  • 计算单词在该行中的出现次数
  • 如果在该行中找到单词,则打印一条消息
  • 增加单词在整个文本中的总出现次数,使其出现在行中

我首先将其作为串行程序编写,然后尝试使用OpenMP对其进行并行化,并比较分析全文所需的时间。不幸的是,与串行版本相比,我无法提高并行程序的时间。

为了测量时间,我使用以下功能:

double ClockCounter;
void start()
{
ClockCounter = omp_get_wtime();
}
void stop()
{
ClockCounter = omp_get_wtime() - ClockCounter;
printf("Elapsed time: %fn", ClockCounter);
}

以及文本分析功能:

void analyze_text(char **lines, char *words[], size_t nr_lines, int nr_words, int k) {
int i, total_word_count, word_occurence;
start();
#       pragma omp parallel for num_threads(k) 
default(none) shared(total_word_count, word_occurence) private(i, j, nr_words, nr_lines, word_occurence, counter)
for (i=0; i<nr_words; i++) {
total_word_count = 0;
size_t j;
printf("The word is: %s.nSEARCHING...n", words[i]);
//#             pragma omp parallel for num_threads(k) 
//              default(none) shared(total_word_count) private(j, nr_lines, word_occurence, counter)
for (j=0; j<nr_lines; j++) {
char *line = NULL;
line = (char *) malloc (strlen(lines[j]) + 1);
strcpy(line, lines[j]);
word_occurence = 0;
if (strstr(line, words[i]) != NULL) {
printf("Word found at line %ld, position(s): ", j);
char *found_word = NULL;
found_word = (char *) malloc (strlen(words[i]) +1);
char delim[] = " -";
char *ptr = strtok(line, delim);
int counter =  0;
while (ptr != NULL) {
counter++;
strncpy(found_word, ptr, strlen(words[i]));
found_word[strlen(found_word)] = '';
if (strcmp(words[i], found_word) == 0) {
word_occurence++;
printf("%d ", counter);
}
ptr =  strtok(NULL, delim);
}
printf("nline[%3zu]: %sn", j, lines[j]);
}
total_word_count += word_occurence;
}
printf("The word appears %d times in the text in total.nn", total_word_count);
}
stop();
}

正如您所看到的,我也只尝试并行化内部for循环,但也没有得到任何改进。由于嵌套的for循环有点混乱,我不知道在哪里以及如何声明parallel for指令,也不知道如何正确地将变量设置为shared或private。

顺序版本看起来完全相同,但没有声明omp指令。

时间(~=)与顺序版本:

Elapsed time: 0.025583

时间(~=)与并行版本(k=2,4,6,8,始终相同):

Elapsed time: 0.025690

有人能向我解释一下我在这里缺了什么吗?

TL;DR答案:您的代码有内存限制(读取或写入大量内存,实际上没有CPU密集型计算)。CPU很容易超过可用内存带宽,增加线程数量不会提高性能。这种情况可能发生在您的硬件上,令人惊讶的是,这种情况也会发生在编译器资源管理器上,但不会发生在我的计算机上。我真的很困惑。

详细答案:

您的代码有一些问题(数据争用、内存泄漏、变量在数据子句中多次出现等),所以我做了以下操作来修复这些错误并使您的代码更快:

  1. 在其所需的最小范围内使用了变量。它有助于避免数据竞争,并帮助编译器进行更好的优化
  2. 已更正OpenMP指令。在i循环之前,您必须添加以下行:
#pragma omp parallel for num_threads(k) default(none) shared(lines, words, nr_words, nr_lines)

另一种选择是并行化j循环。为了避免比赛条件,你必须使用减少(reduction(+:total_word_count)):

#pragma omp parallel for num_threads(k) default(none) shared(i,lines, words, nr_words, nr_lines) reduction(+:total_word_count)
  1. 删除了耗时的malloc,并将其替换为静态分配:
//Maximum length of a line - modify it as necessary and make sure that no lines are longer than this value
#define MAX_LINE_LEN 1000
...
char line[MAX_LINE_LEN]; 
....
char found_word[MAX_LINE_LEN]; //it may be smaller
  1. strcpy(line, lines[j]);在随后的if语句之后被移动。这相当耗时,并且只有在找到一个单词时才需要。所以
char *line = NULL;
line = (char *) malloc (strlen(lines[j]) + 1);
strcpy(line, lines[j]);
word_occurence = 0;
if (strstr(line, words[i]) != NULL) {

已更改为

int word_occurence = 0;
if (strstr(lines[j], words[i]) != NULL) {
char line[MAX_LINE_LEN];                        
strcpy(line, lines[j]);
...
//Note that the code below is not critical (rarely executed)...
  1. 将所有printf函数更改为
#ifdef USE_PRINTF
printf(...);
#endif

因此,很容易打开/关闭打印输出,也更容易测试其性能。请注意,如果使用了更多的线程,打印输出将被混合,因此您必须首先收集输出,然后再打印。

  1. 请注意,最好重写代码,首先在文本行上循环,然后在一行中搜索每个单词。在这种情况下,缓存利用率会更好,这将提高性能。

  2. 添加了一个创建文本的主要功能(通过从5个预定义行中随机添加行)来测试其性能

代码如下:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>
//Maximum length of a line - modify it as necessary and make sure that no lines are longer than this value
#define MAX_LINE_LEN 1000
//Uncomment this line to see printout
//#define USE_PRINTF
double ClockCounter;
void start()
{
ClockCounter = omp_get_wtime();
}
void stop()
{
ClockCounter = omp_get_wtime() - ClockCounter;
printf("Elapsed time: %fn", ClockCounter);
}

void analyze_text(char **lines, const char *words[], const size_t nr_lines, const int nr_words, const int k) {
start();
#pragma omp parallel for num_threads(k) default(none) shared(lines, words, nr_words, nr_lines) 
for (int i=0; i<nr_words; i++) {
int total_word_count = 0;
#ifdef USE_PRINTF
printf("The word is: %s.nSEARCHING...n", words[i]);
#endif
//#pragma omp parallel for num_threads(k) default(none) shared(i,lines, words, nr_words, nr_lines) reduction(+:total_word_count)
for (size_t j=0; j<nr_lines; j++) {
int word_occurence = 0;
if (strstr(lines[j], words[i]) != NULL) {
char line[MAX_LINE_LEN];                        
strcpy(line, lines[j]);
#ifdef USE_PRINTF
printf("Word found at line %ld, position(s): ", j);
#endif
char found_word[MAX_LINE_LEN]; 
char delim[] = " -";
char *ptr = strtok(line, delim);
int counter =  0;
while (ptr != NULL) {
counter++;
strncpy(found_word, ptr, strlen(words[i]));
found_word[strlen(found_word)] = '';
if (strcmp(words[i], found_word) == 0) {
word_occurence++;
#ifdef USE_PRINTF
printf("%d ", counter);
#endif
}
ptr =  strtok(NULL, delim);
}
#ifdef USE_PRINTF
printf("nline[%3zu]: %sn", j, lines[j]);
#endif
}
total_word_count += word_occurence;
}
#ifdef USE_PRINTF
printf("The word appears %d times in the text in total.nn", total_word_count);
#endif
}
stop();
}
int main(){    
const size_t nr_lines=40000;
char* lines[nr_lines];
//sample lines, the text have these lines in a random order
const char* samples[]={
"xxx yyy zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
"one yyy zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
"xxx two zzz qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
"xxx yyy three qqq sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
"xxx yyy zzz four sss dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt",
"xxx yyy zzz qqq five dddd aaaaa bbbbb cccc dd eeee fff ggg hh iii jjj kkkk llll mmm nnn oo pppp qqq rrrr ssss ttt"
};

for(size_t i=0;i<nr_lines;++i){
unsigned long int rnd=rand()%1000;
rnd = rnd >= sizeof(samples)/sizeof(samples[0]) ? 0 : rnd;
lines[i]=(char*) malloc(strlen(samples[rnd])+1);
strcpy(lines[i],samples[rnd]);
}

printf("nr_lines=%ldn",nr_lines);
const char* words[]={"one", "two", "three", "four","five"};
const size_t nr_words=sizeof(words)/sizeof(words[0]);
printf("nr_words=%ldn",nr_words);
for(int i=1;i<6;i++)
{
printf("Threads used=%d  --- ",i);
analyze_text(lines, words, nr_lines, nr_words, i);
}
// You should free allocated memory here.
return 0;
}

我在编译器资源管理器上测试了代码,首先使用400000行(注意40000行太快了,所以我使用了比您多10倍的行)和5个单词并行化了第一个(i)循环,结果是:

nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.032551
Threads used=2  --- Elapsed time: 0.028461
Threads used=3  --- Elapsed time: 0.029304
Threads used=4  --- Elapsed time: 0.027002
Threads used=5  --- Elapsed time: 0.026587

在我的笔记本电脑上,使用核心的缩放效果要好得多:

gcc 5.c -fopenmp -O3 -mavx2  && ./a.out
nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.046173
Threads used=2  --- Elapsed time: 0.028958
Threads used=3  --- Elapsed time: 0.019951
Threads used=4  --- Elapsed time: 0.015462
Threads used=5  --- Elapsed time: 0.020994

我还并行化了第二个循环(在这种情况下,为了避免嵌套并行,第一个循环没有并行化)。编译器资源管理器上的结果:

nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.032781
Threads used=2  --- Elapsed time: 0.026288
Threads used=3  --- Elapsed time: 0.026601
Threads used=4  --- Elapsed time: 0.027013
Threads used=5  --- Elapsed time: 0.027567

在我的笔记本电脑上:

gcc 5.c -fopenmp -O3 -mavx2  && ./a.out
nr_lines=400000
nr_words=5
Threads used=1  --- Elapsed time: 0.047092
Threads used=2  --- Elapsed time: 0.024240
Threads used=3  --- Elapsed time: 0.016849
Threads used=4  --- Elapsed time: 0.014475
Threads used=5  --- Elapsed time: 0.012770
Threads used=6  --- Elapsed time: 0.011703
Threads used=7  --- Elapsed time: 0.009543
Threads used=8  --- Elapsed time: 0.012953

你可以看到

  1. 在编译器资源管理器上获得的结果令人失望。我只是不知道它的原因是什么。很可能与内存带宽/内存访问的可用通道有关,但我认为高端处理器被用来托管编译器资源管理器。

  2. 在我的笔记本电脑上,缩放效果要好得多,尤其是如果第二个循环是并行的。这并不奇怪,因为在这种情况下缓存利用率要好得多。因此,我建议你重写你的代码,首先逐行搜索,在一行中搜索所有单词。

最后一点注意:如果性能至关重要,那么很可能您可以找到一个并行库,它旨在尽可能高效地进行这种类型的搜索。

最新更新