我如何提高这个Fortran代码与Python接口的性能?

我正在研究在Python中使用Fortran代码接口的可能性。我知道f2py，但由于我没有成功地将它与外部库(如lapack)一起使用，所以我恢复使用ctypes。我的一切都在工作，但我注意到这样的行为:对于相同的算法(两个矩阵元素的简单乘法)，Fortran代码比numpy慢，而我期望它更快。我试用了两台不同的电脑，一台安装Windows 10，另一台安装Windows 11。我使用的gfortran是从MinGW-64通过winlibs.com版本gcc-12.2.0-llvm-15.0.7-mingw-w64ucrt-10.0.0-r4安装，我的Python版本是3.9.13

使用timeit测量的时间为(对于矩阵1000x1000)

使用Numpy: 2.77 ms±110µs/loop(平均±std. dev. of 7次运行，每次100个循环)

使用Fortran: 12.4 ms±314µs/loop(平均值±std. dev. of 7次运行，每次100个循环)

Fortran代码在文件"libfortran3.f90">

subroutine prodotto(a, b, n) bind(c, name='prodotto')
! declarations
use iso_c_binding, only: c_double, c_int
integer(c_int), intent(in) :: n
real(c_double), dimension(n,n), intent(in) :: b
real(c_double), dimension(n,n), intent(inout) :: a

do i = 1,n
do j = 1,n
a(i,j) = a(i,j)*b(i,j)
end do
end do

end subroutine prodotto

，用gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -o libfortran3.so libfortran3.f90编译。请注意，我对输入和输出使用了相同的矩阵，因此我不必为结果传递另一个矩阵(因为我预计这会使问题恶化)。

Python代码(在笔记本中运行)为

from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys
fortran = CDLL('./libfortran3.so')
fortran.prodotto.argtypes = [ POINTER(c_double), 
POINTER(c_double), 
POINTER(c_int)]
fortran.prodotto.restype = None
#
N = 10
A = np.random.rand(N,N)
B = np.random.rand(N,N)
#
print('With Numpy')
%timeit A*B
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
B = np.asfortranarray(B, dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
Bct = B.ctypes.data_as(POINTER(c_double))
%timeit fortran.prodotto( Act, Bct, c_int(N) )

当然我的Fortran代码不像Numpy那样优化，但我想知道是否有一些我可以修改以获得更好的性能。

编辑2023年4月3日

感谢mkrieger1和janneb，我确实在嵌套循环中做了一个糟糕的决定!在libfortran3.f90&quot中切换i与j;给我这个计时(矩阵为1000x1000):

使用Numpy: 2.67 ms±148µs/loop(平均±std. dev. of 7次运行，每次100个循环)

使用Fortran: 1.01 ms±34.9µs/loop(平均值±std. dev. of 7次运行，每次循环1000次)

我还尝试使用f2py，在文件"libfortran1.f90">

中使用以下代码

subroutine prodotto(a, b, c, n) 
! declarations
integer, intent(in) :: n
real(kind=8), dimension(n,n), intent(in) :: a, b
real(kind=8), dimension(n,n), intent(out) :: c

do j = 1,n
do i = 1,n
c(i,j) = a(i,j)*b(i,j)
end do
end do

end subroutine prodotto

用python -m numpy.f2py --compiler=mingw32 --fcompiler=gfortran -c libfortran1.f90 -m libfortran1编译，在笔记本上运行，得到:

与f2py: 12.3 ms±260µs每回路(平均值±std. dev. 7次运行，每100个回路)

比纯numpy慢多了。

为了完整，如果我尝试在"libfortran0.f90">

中编译使用Lapack(我在我的系统上构建的)的fortran代码

subroutine risolvi(A,b,x,n) bind(c, name='risolvi')
! declarations
use iso_c_binding, only: c_double, c_int
integer(c_int), intent(in) :: n
real(c_double), dimension(n,n), intent(inout) :: A
real(c_double), dimension(n), intent(inout) :: b
real(c_double), dimension(n), intent(out) :: x
integer(c_int), dimension(n) :: pivot(n)
integer(c_int) :: ok
! call LAPACK subroutine
call DGESV(n, 1, A, n, pivot, b, n, ok) 
x = b

end subroutine risolvi

用gfortran -O3 -funroll-loops -ffast-math -fPIC -shared -L. -lblas -llapack -o libfortran0.so libfortran0.f90编译，与numpy

比较

from ctypes import CDLL, POINTER, c_int, c_double
import numpy as np
import time
import sys
import libfortran1
fortran = CDLL('./libfortran0.so')
fortran.risolvi.argtypes = [ POINTER(c_double), 
POINTER(c_double), 
POINTER(c_double), 
POINTER(c_int)]
fortran.risolvi.restype = None
#
N = 1000
A = np.random.rand(N,N)
b = np.random.rand(N)
#
print('With Numpy')
%timeit np.linalg.solve(A,b)
#
print('With Fortran')
A = np.asfortranarray(A, dtype=c_double)
b = np.asfortranarray(b, dtype=c_double)
x = np.asfortranarray(np.zeros(N), dtype=c_double)
Act = A.ctypes.data_as(POINTER(c_double))
bct = b.ctypes.data_as(POINTER(c_double))
xct = x.ctypes.data_as(POINTER(c_double))
fortran.risolvi( Act, bct, xct, c_int(N) )
%timeit fortran.risolvi( Act, bct, xct, c_int(N) )

我得到以下计时(N=1000)

使用Numpy: 13.6 ms±596µs/loop(平均值±std. dev. of 7次运行，每次100个循环)

使用Fortran: 221 ms±9.74 ms/循环(平均±std. dev. 7次运行，每次1个循环)

无论如何，当N=30时，使用Fortran会更快

使用Numpy: 15.6µs±244 ns/loop(平均±std. dev. of 7次运行，每次100000次循环)

使用Fortran: 11.4µs±237 ns/循环(平均±std. dev. 7次运行，每次100000循环)

是可能的，有一些错误在我的代码传递大数组时?或者就像我读到的那样，"ctypes是为了兼容性而不是速度"?

正如在注释中提到的，您在Fortran代码中循环的方式是错误的，因为Fortran是列为主的。话虽如此，在这种情况下，由于它是一个元素对元素的乘法运算，您只需将嵌套的do循环替换为简单的

a = a * b

其次，N=10非常短，调用外部函数的额外开销掩盖了实际代码的性能。

设置N=100并用数组乘法替换手动循环，我得到

With Numpy
5.00679704999493
With Fortran
3.699547360010911

相关内容

最新更新

热门标签：