同时执行多个进程时急剧下降

i编写了一个非常简单的代码，其中包含使用Fortran和Python的数组求和。当我使用Shell提交多个（独立的）作业时，当线程数大于一个时，将会有巨大的减速。

我的代码的fortran版本如下

显示

program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val
call random_seed()
do i = 1, N_tiles*N_tilings
  do j = 1, max_t_steps
    do k = 1, 5
      call random_number(rand_val)
      test_e(i, j, k) = rand_val
      call random_number(rand_val)
      test_theta(i, j, k) = rand_val
    end do
  end do
end do
call CPU_TIME(begin)
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
call CPU_TIME(end)
write(*, *) 'total time cost is : ', end-begin
end program main

和一个shell-scipt如下显示

#!/bin/bash
gfortran -o result test.f90
nohup ./result &
nohup ./result &
nohup ./result &

正如我们所看到的，主要操作是数组test_theta和test_e的总和。这些阵列不大（大约3MB），我的计算机的内存空间足以完成此工作。我的工作站有6个核心，带有12个线程。我尝试一次使用外壳提交1、2、3、4和5的工作，并且时间成本如下

| #jobs   |  1   |   2   |   3    |  4    |  5   |
| time(s) |  21  |   31  |   161  |  237  |  357 |

我希望一旦线程数小于我们拥有的核心数量，n线程作业的时间应该与单线程作业相同，这是我的计算机的6个。但是，我们在这里发现了戏剧性的速度。

当我使用Python实施相同的任务

时，此问题仍然存在

import numpy as np 
import time
N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
begin = time.clock()
for i in range(1001):
    for j in range(50):
        theta += 0.5*e
end = time.clock()
print('total time cost is {} s'.format(end-begin))

我不知道原因，我想知道它是否与CPU的L3缓存大小有关。也就是说，缓存太小，对于此类多线程作业。也许它也与所谓的"错误共享"问题有关。我该如何解决？

这个问题与以前的一个戏剧性使用多通道和python中的numpy有关，在这里我只发布一个简单而典型的示例。

多次运行时代码可能会很慢，因为您越来越多的内存必须流过有限的带宽内存总线。

如果您仅运行一个过程，那一次仅与一个数组一起使用，但是启用OpenMP线程，可以更快地进行：

integer*8 :: begin, end, rate
...
call system_clock(count_rate=rate)
call system_clock(count=begin)
!$omp parallel do
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
!$omp end parallel do
call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

在四核CPU上：

> gfortran -O3 testperformance.f90 -o result
> ./result 
 total time cost is :    15.135917384000001
> gfortran -O3 testperformance.f90 -fopenmp -o result
> ./result 
 total time cost is :    3.9464441830000001

相关内容

最新更新

热门标签：