排序32位整型和排序64位整型一样快



我认为这些都是事实:

  • 快速排序应该是相当缓存友好。
  • 一条64字节的缓存行可以包含16个32位整数,也可以包含8个64位整数。

假说:

  • 排序32位整数向量应该比排序a更快

但是当我运行下面的代码时,我得到了结果:

i16 = 7.5168 
i32 = 7.3762 
i64 = 7.5758

为什么我得不到我想要的结果?

c++:

#include <iostream> 
#include <vector>
#include <cstdint>
#include <algorithm>
#include <chrono>

int main() {
  const int vlength = 100'000'000;
  const int maxI = 50'000;
  std::vector<int16_t> v16;
  for (int i = 0; i < vlength; ++i) {
    v16.push_back(int16_t(i%maxI));
  }
  std::random_shuffle(std::begin(v16), std::end(v16));
  std::vector<int32_t> v32;
  std::vector<int64_t> v64;
  for (int i = 0; i < vlength; ++i) {
    v32.push_back(int32_t(v16[i]));
    v64.push_back(int64_t(v16[i]));
  }
  auto t1 = std::chrono::high_resolution_clock::now();
  std::sort(std::begin(v16), std::end(v16));
  auto t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i16 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count() << std :: endl;
  t1 = std::chrono::high_resolution_clock::now();
  std::sort(std::begin(v32), std::end(v32));
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i32 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count() << std :: endl;
  t1 = std::chrono::high_resolution_clock::now();
  std::sort(std::begin(v64), std::end(v64));
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i64 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count() << std :: endl;
}
编辑:

为了避免缓存友好排序的问题,我还尝试了以下代码:

template <typename T>
inline void function_speed(T& vec) {
  for (auto& i : vec) {
    ++i;
  }
}
int main() {
  const int nIter = 1000;
  std::vector<int16_t> v16(1000000);
  std::vector<int32_t> v32(1000000);
  std::vector<int64_t> v64(1000000);

  auto t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < nIter; ++i) {
    function_speed(v16);
  }
  auto t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i16 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count()/double(nIter) << std :: endl;
  t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < nIter; ++i) {
    function_speed(v32);
  }
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i32 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count()/double(nIter) << std :: endl;
  t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < nIter; ++i) {
    function_speed(v64);
  }
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i64 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count()/double(nIter) << std :: endl;
}
典型结果:

i16 = 0.00618648
i32 = 0.00617911
i64 = 0.00606275

我知道适当的标杆本身就是一门科学,也许我做错了。

EDIT2: 通过避免溢出,我现在开始得到更有趣的结果:

template <typename T>
inline void function_speed(T& vec) {
  for (auto& i : vec) {
    ++i;
    i %= 1000;
  }
}

给出如下结果:

i16 = 0.0143789
i32 = 0.00958941
i64 = 0.019691

如果我这样做:

template <typename T>
inline void function_speed(T& vec) {
  for (auto& i : vec) {
    i = (i+1)%1000;
  }
}

:

i16 = 0.00939448
i32 = 0.00913768
i64 = 0.019615

错误的假设;所有O(N log N)排序算法对于绝大多数N来说都是缓存不友好的!可能的输入。

此外,我认为优化编译器可以完全删除排序,而未优化的构建当然对基准测试毫无意义。

最新更新