这是一个64位MASM程序,可乘以两个64位编号,并添加64位编号以给出128位的结果(使用标准的64位调用公约):
; public static extern ulong MulAdd64(ulong U, ulong V, ref ulong k);
; Return (U*V + k) % ß and set k = (U*V + k) / ß.
; U in rcx, V in rdx, &k in r8
; Note 0 <= 0*0 + 0 <= (ß-1)*(ß-1) + (ß-1) = ß*(ß-1) < ß^2
MulAdd64 proc public
mov rax,rcx
mul rdx
add rax,qword ptr [r8] ; low part of product
adc rdx,0
mov qword ptr [r8],rdx ; high part of product
ret
MulAdd64 endp
通过:
将其导入C#代码 [DllImport(@"C:pathMulAdd64.dll")]
public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);
现在,这里是C#中编写的相同函数以及测试程序:
public static void TestCS_Masm_Speed ()
{
ulong x = 3141592653589793238, y = 2718281828459045, aux = 1234567890123456789, lo = 0;
// just in case the first invocation is excessivly slow
//lo = MulAdd64(x, y, ref aux);
lo = CS_MulAdd64(x, y, ref aux);
Stopwatch sw = new Stopwatch();
sw.Restart();
for (int i = 0; i < 10000000; i++) {
//lo = MulAdd64(x, y, ref aux);
lo = CS_MulAdd64(x, y, ref aux);
}
Console.WriteLine("Ticks = {0}", sw.ElapsedTicks);
// verify low-order results hi:lo = x*y + aux;
if (x*y+aux != lo) Console.WriteLine("Error in low order result");
else Console.WriteLine("Low order result is OK");
}
[DllImport(@"C:pathMulAdd64.dll")]
public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);
/*
Multiplication. We need to multiply two unsigned 64-bit integers, obtaining an
unsigned 128-bit product. Using Algorithm 4.3.1M of Seminumerical Algorithms,
with ß = 2^32, the following subroutine computes hi:lo = y*z + aux . Then
sets aux to hi and return lo.
*/
public static ulong CS_MulAdd64 (ulong y, ulong z, ref ulong aux)
{
ulong[] u = new ulong[2], v = new ulong[2], w = new ulong[4];
// Unpack the multiplier, multiplicand, and aux to u, v, and w
u[1] = (ulong)y >> 32; u[0] = (ulong)y & 0xFFFFFFFF;
v[1] = (ulong)z >> 32; v[0] = (ulong)z & 0xFFFFFFFF;
w[1] = (ulong)aux >> 32; w[0] = (ulong)aux & 0xFFFFFFFF;
// Multiply
for (int j = 0; j < 2; j++) {
ulong k = 0;
for (int i = 0; i < 2; i++) {
ulong t = u[i] * v[j] + w[i+j] + k;
w[i+j] = t & 0xFFFFFFFF; k = t >> 32;
}
w[j+2] = k;
}
// Pack w into the outputs aux and return w
aux = ((ulong)w[3] << 32) + (ulong)w[2];
return ((ulong)w[1] << 32) + (ulong)w[0];
}
优化的C#代码明显长于MASM代码(153个说明与6个说明),但运行速度几乎是速度的两倍(941694 tick ticks vs 1722289 ticks)!怎么会这样?一切都传递在寄存器中,没有记忆可以固定!显然,C#的呼叫与MASM的执行之间发生了一些事情,但是什么?我无法介入该代码。
通过使用Google,您可以更好地了解与托管/本地过渡相关的间接费用。这是一些文章:
- Interop(C )的性能注意事项
- 设法不受管理的开销
- 在"管理到本地的过渡"期间到底会发生什么?
在像您这样的案例中,调整呼叫约定和其他必需调整的开销可能高于呼叫本身。
通常,如果您进行了足够的乘法,则将它们批量批量以最小化开销。而且您将进行一些测试以找到最佳的大小。,如果您真的很认真地对性能进行认真,那么您会考虑所有诸如缓存大小,并行处理等所有内容...
我相信,C#编译器可以相对优化您的代码,以便可以正确测序在CPU级别的说明中,因此几乎没有浪费的周期。它可能能够使用SIMD或其他技巧来优化代码。否则,您可能可以在C#中写出更有效的乘法。
顺便说一句,当Windows已经提供了执行此类乘法的函数时,为什么要编写自己的代码:unsignedmultiply128函数?
您是否比较了BigInteger性能?