我试图将数组的元素与SIMD并行求和。为了避免锁定,我使用可组合的线程局部,它并不总是在16字节上对齐因为_mm_add_epi32抛出异常
concurrency::combinable<__m128i> sum_combine;
int length = 40; // multiple of 8
concurrency::parallel_for(0, length , 8, [&](int it)
{
__m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it));
__m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t)));
auto temp = _mm_add_epi32(v1, v2);
auto &sum = sum_combine.local(); // here is the problem
TRACE(L"%dn", it);
TRACE(L"add %xn", &sum);
ASSERT(((unsigned long)&sum & 15) == 0);
sum = _mm_add_epi32(temp, sum);
}
);
这里是可组合的定义,来自ppl.h
template<typename _Ty>
class combinable
{
private:
// Disable warning C4324: structure was padded due to __declspec(align())
// This padding is expected and necessary.
#pragma warning(push)
#pragma warning(disable: 4324)
__declspec(align(64))
struct _Node
{
unsigned long _M_key;
_Ty _M_value; // this might not be aligned on 16 bytes
_Node* _M_chain;
_Node(unsigned long _Key, _Ty _InitialValue)
: _M_key(_Key), _M_value(_InitialValue), _M_chain(NULL)
{
}
};
有时对齐是可以的,代码工作得很好,但大多数时候它不工作
我已经尝试使用下面的命令,但是这不能编译
union combine
{
unsigned short x[sizeof(__m128i) / sizeof(unsigned int)];
__m128i y;
};
concurrency::combinable<combine> sum_combine;
then auto &sum = sum_combine.local().y;
有什么建议纠正对齐问题,仍然使用组合式。
在x64上,它工作得很好,因为默认的16字节对齐。在x86上有时存在对齐问题。
使用unaligned load加载sum
auto &sum = sum_combine.local();
#if !defined(_M_X64)
if (((unsigned long)&sum & 15) != 0)
{
// just for breakpoint means, sum is unaligned.
int a = 5;
}
auto sum_temp = _mm_loadu_si128(&sum);
sum = _mm_add_epi32(temp, sum_temp);
#else
sum = _mm_add_epi32(temp, sum);
#endif
由于_mm_add_epi32
使用的sum
变量未对齐,您需要使用未对齐的加载/存储(_mm_loadu_si128
/_mm_storeu_si128
)显式加载/存储sum
。变化:
sum = _mm_add_epi32(temp, sum);
:
__m128i v2 = _mm_loadu_si128((__m128i *)&sum);
v2 = _mm_add_epi32(v2, temp);
_mm_storeu_si128((__m128i *)&sum, v2);