多线程，为什么这个串行代码比它的并行版本更快

我正在尝试cpp11中的一些多线程，但不明白为什么在下面的代码中，串行版本比并行版本快得多。

我知道在这个最小的例子中，计算函数不值得并行化，但我想在RayTracing算法中使用类似的方法来并行化像素渲染，在这种算法中，计算需要更长的时间，但在另一种情况下，我得到了相同的持续时间差异。

我想我遗漏了一些线索。如有任何帮助或指导，我们将不胜感激。

#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void compute(double& res)
{
res = 2*res;
}
void computeSerial(std::vector<double>& res, const size_t& nPoints)
{
for (size_t i = 0; i < nPoints; i++)
{
compute(res[i]);
}
}
void computeParallel(std::vector<double>& res, const size_t& nPoints)
{
int numThreads = std::thread::hardware_concurrency() - 1;
std::vector<std::thread*> pool(numThreads, nullptr);
size_t nPointsComputed = 0;
while(nPointsComputed < nPoints)
{
size_t firstIndex = nPointsComputed;
for (size_t i = 0; i < numThreads; i++)
{
size_t index = firstIndex + i;
if(index < nPoints)
{
pool[i] = new std::thread(compute, std::ref(res[index]));
}
}
for (size_t i = 0; i < numThreads; i++)
{
size_t index = firstIndex + i;
if(index < nPoints)
{
pool[i]->join();
delete pool[i];
}
}
nPointsComputed += numThreads;
}
}
int main(void)
{
size_t pbSize = 1000;
std::vector<double> vSerial(pbSize, 0);
std::vector<double> vParallel(pbSize, 0);
for (size_t i = 0; i < pbSize; i++)
{
vSerial[i] = i;
vParallel[i] = i;
}
int numThreads = std::thread::hardware_concurrency();
std::cout << "Number of threads: " << numThreads << std::endl;
std::chrono::steady_clock::time_point begin, end;
begin = std::chrono::steady_clock::now();
computeSerial(vSerial, pbSize);
end = std::chrono::steady_clock::now();
std::cout << "duration serial   = " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "[ns]" << std::endl;
begin = std::chrono::steady_clock::now();
computeParallel(vParallel, pbSize);
end = std::chrono::steady_clock::now();
std::cout << "duration parallel = " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "[ns]" << std::endl;
return 0;
}

使用clang++ -pthread main.cc编译后，我得到以下输出：

Number of threads: 6
duration serial   = 23561[µs]
duration parallel = 12219928[µs]

串行版本始终比并行版本快得多，无论要计算的双倍数是多少。

启动一个线程(甚至只是进行动态分配(需要比计算数字的两倍更多的指令。

你需要把你的工作分成更大的部分。。。仅仅为了计算CCD_ 2而启动单独的CPU线程将永远不会是优化。

将计算4000000个数字的两倍的工作分成4个线程是有意义的，每个线程在一个循环中计算1000000个结果。

GPU的情况非常不同，例如为每个像素运行一个线程是可以的

看起来好像您正在创建1000个线程，而不是6个线程，因为在您的行中

nPointsComputed += numThreads;

以线程数递增，并且循环运行到nPointsComputed<1000.

相反，你必须

创建一批CCD_ 3
然后创建numberOfThreads线程，每个线程处理一批具有偏移量的大小为numberOfPointsPerThread的线程，即线程i处理indices k = i * numberOfPointsPerThread, ..., (i+1)*numberOfPointsPerThread-1

如果division numberOfPoints/numberOfThreads有余数，您必须小心。使用ceil函数创建较大的批，并将最后一个批限制在数组的末尾。

分配/取消分配线程非常耗时，因为内核要为上下文更改做很多事情。这就是为什么最好在每个核心上使用一个线程，并使用一个fifo，在这个fifo中可以提供许多将在运行的线程之间调度的任务。

请查看此代码：https://github.com/sancelot/thread_pool

相关内容

最新更新

热门标签：