我尝试使用CyclicDist
模块实现矩阵乘法。
当我使用一个区域设置与两个区域设置进行测试时,一个区域设置要快得多。是因为两个 Jetson 纳米板之间的通信时间真的很长,还是我的实现没有利用CyclicDist
的工作方式?
这是我的代码:
use Random, Time, CyclicDist;
var t : Timer;
t.start();
config const size = 10;
const Space = {1..size, 1..size};
const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
forall j in 1..size do {
forall k in 1..size do {
grid3[i,j] += grid[i,k] * grid2[k,j];
}
}
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()
我知道我的实现不是分布式内存系统的最佳选择。
这个程序不缩放的主要原因可能是计算从不使用初始语言环境以外的任何语言环境。 具体来说,forall 循环遍历范围,就像代码中的循环一样:
forall i in 1..size do
始终使用在当前区域设置上执行的任务运行其所有迭代。 这是因为范围不是 Chapel 中的分布值,因此,它们的并行迭代器不会跨区域设置分配工作。 因此,循环体的所有大小**3执行:
grid3[i,j] += grid[i,k] * grid2[k,j];
将在区域设置 0 上运行,而它们都不会在区域设置 1 上运行。 通过将以下内容放入最内层循环的主体中,您可以看到这种情况:
writeln("locale ", here.id, " running ", (i,j,k));
(其中here.id
打印出运行当前任务的区域设置的 ID(。 这将显示区域设置 0 正在运行所有迭代:
0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...
与此对比的是,在分布式域上运行forall循环,如gridSpace
:
forall (i,j) in gridSpace do
writeln("locale ", here.id, " running ", (i,j));
迭代将在区域设置之间分布:
locale 0 running (1, 1)
locale 0 running (9, 1)
locale 0 running (1, 2)
locale 0 running (9, 2)
locale 0 running (1, 3)
locale 0 running (9, 3)
locale 0 running (1, 4)
locale 1 running (8, 1)
locale 1 running (10, 1)
locale 1 running (8, 2)
locale 1 running (2, 1)
locale 1 running (8, 3)
locale 1 running (10, 2)
...
由于所有计算都在区域设置 0 上运行,但一半的数据位于区域设置 1 上(由于数组是分布式的(,因此会生成大量通信以将远程值从区域设置 1 的内存获取到区域设置 0 的内存,以便对其进行计算。
问:是因为两块 Jetson 纳米板之间的通信时间(1( 真的很长,还是我的实现(2(没有利用
CyclicDist
的工作方式?
第二个选项是肯定的赌注:~ 100 x
小尺寸的CyclicDist
数据上实现了更差的性能。
文档明确警告了这一点,说:
循环分布将索引映射到从给定索引开始的循环模式中的区域设置。
...
限制
此分发尚未针对性能进行优化。
在单区域设置平台上可以证明对处理效率的不利影响,其中所有数据都驻留在区域设置本地内存空间中,因此不会增加任何 NUMA 板间通信附加成本。与 Vass 的单forall{}
D3
迭代和积相比,实现的性能仍然差~ 100 x
(直到现在才注意到 Vass 的性能促使从原始forall-in-D3-do-{}
更改为另一个配置forall-in-D2-do-for{}
-串联迭代修订版 - 到目前为止,小尺寸 --快速 --ccflags -O3 执行的测试显示长度几乎是长度的一半forall-in-D2-do-for{}
迭代器迭代器结果的性能更差,甚至比 O/P 三重forall{}
原始提案更差,除了 512x512 以下和 -O3 优化发生后, 但对于最小的尺寸 128x128
原始 Vass-D3 独迭代器实现了每个单元~ 850 [ns]
的最高性能,令人惊讶的是没有 --ccflags -O3(对于正在处理的较大--size={ 1024 | 2048 | 4096 | 8192 }
数据布局,这可能会明显改变,如果将更宽的 NUMA 多语言环境和更高的并行度设备放入竞争中,则更多(
TiO.run platform uses 1 numLocales,
having 2 physical CPU-cores accessible (numPU-s)
with 2 maxTaskPar parallelism limit
使用CyclicDist
会影响数据到内存的布局,不是吗?
通过小尺寸测量进行验证--size={128 | 256 | 512 | 640}
有或没有轻微的--ccflags -O3
效应
// --------------------------------------------------------------------------------------------------------------------------------
// --fast
// ------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 255818 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3075 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 3040 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2198 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1974 [us] excl. fillRandom()-ops <-- 127x SLOWER with CyclicDist dmapped DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2122 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 252439 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2141444 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 27095 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 25339 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 23493 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 21631 [us] excl. fillRandom()-ops <-- 98x SLOWER then w/o CyclicDist dmapped data
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 21971 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2122417 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16988685 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17448207 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 268111 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 270289 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 250896 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 239898 [us] excl. fillRandom()-ops <-- 71x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 257479 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17391049 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16932503 [us] excl. fillRandom()-ops <~~ ~2e5 [us] faster without --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35136377 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 362205 [us] incl. fillRandom()-ops <-- 97x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 367651 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345865 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 337896 [us] excl. fillRandom()-ops <-- 103x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 351101 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35052849 [us] excl. fillRandom()-ops <~~ ~3e4 [us] faster without --ccflags -O3
//
// --------------------------------------------------------------------------------------------------------------------------------
// --fast --ccflags -O3
// --------------------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 250372 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3189 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2966 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2284 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1949 [us] excl. fillRandom()-ops <-- 126x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2072 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 246965 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2114615 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 37775 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 38866 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 32384 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 29264 [us] excl. fillRandom()-ops <-- 71x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 33973 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2098344 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17136826 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081273 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 251786 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 266766 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 239301 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 233003 [us] excl. fillRandom()-ops <~~ ~6e3 [us] faster with --ccflags -O3
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 253642 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17025339 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081352 [us] excl. fillRandom()-ops <~~ ~2e5 [us] slower with --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35164630 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 363060 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 489529 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345742 [us] excl. fillRandom()-ops <-- 104x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 353353 [us] excl. fillRandom()-ops <-- 102x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 471213 [us] excl. fillRandom()-ops <~~~12e5 [us] slower with --ccflags -O3
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35075435 [us] excl. fillRandom()-ops
无论如何,Chapel团队的见解(设计方面和测试方面(都很重要。 @Brad被要求提供善意的帮助,为主要更高尺寸的--size={1024 | 2048 | 4096 | 8192 | ...}
和具有多区域设置和多区域设置解决方案的"方式更宽"的 NUMA 平台提供类似的测试覆盖率和比较,这些解决方案可在 Cray 用于 Chapel 团队的研发,不会受到硬件和公众~ 60 [s]
限制的影响, 赞助,共享TiO.RUN平台。