为什么我的程序对大于2^29的数字失败?



我编写了下面的程序,它获取长度为2^ I的列表'a'(初始化为所有1),并将其中包含的所有数字加在一起。当i至少为30时,它返回一个无意义的答案。我不明白为什么,我用long's的所有东西,在我的机器上,一个long的大小是8字节= 64位,所以我想说它必须能够保存整数到2^(8 * 8)/2。

// FOR NOW ONLY WORKS WITH N A POWER OF 2
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#include <chrono>
/* 
Parallel reduce helper function. When run 
with n/2 threads, adds a[n - 1 - i] to a[i]
for i = 0, ..., n - 1.
*/
__global__ void reduce(long* a, long n)
{
long i = threadIdx.x + blockDim.x * blockIdx.x;
long stride = gridDim.x * blockDim.x;
for (long j = i; j < n/2; j += stride)
{
a[j] += a[n - 1 - j];
}
}
/* 
For an array a of length n, puts the sum of all elements in a[0]
*/
void parallelReduce(long* a, long n)
{
// Get some information about the GPU
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int multiProcessors = prop.multiProcessorCount;
// Repeatedly use the helper function reduce
while (n > 1) {
int threadsPerBlock = 256;
int numberOfBlocks = 32 * multiProcessors;
reduce << <numberOfBlocks, threadsPerBlock >> > (a, n);
cudaDeviceSynchronize();
n = (n + n % 2) / 2; // Rounds n/2 up.
}
}
int main()
{
// Initialize vector with N 1's.
long N = 2 << 28;
size_t size = N * sizeof(long);
long* h_a;
cudaMallocHost(&h_a, size);
for (long i = 0; i < N; i++) {
h_a[i] = 1;
}
// Copy to device (can be done asynchronically to hide transfer time, but 
// that messes up the timing of the kernel).
long* d_a;
cudaMalloc(&d_a, size);
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
// Calculate the sum sequentially and time it.
auto tic = std::chrono::high_resolution_clock::now();
long hostSolution = 0;
for (long i = 0; i < N; i++)
{
hostSolution += h_a[i];
}
auto toc = std::chrono::high_resolution_clock::now();
int duration = std::chrono::duration_cast<std::chrono::milliseconds>(toc - tic).count();
std::cout << "The sequential function says the answer is " << hostSolution << " this took " << duration
<< " ms." << std::endl;
// Kernel computation
tic = std::chrono::high_resolution_clock::now();
parallelReduce(d_a, N);
toc = std::chrono::high_resolution_clock::now();
int parallelDuration = std::chrono::duration_cast<std::chrono::milliseconds>(toc - tic).count();
// Copy result back to host
long solution;
cudaMemcpy(&solution, &d_a[0], sizeof(long), cudaMemcpyDeviceToHost);
// Print the parallel result and speed up:
std::cout << "The parallel function says the answer is " << solution << " this took " << parallelDuration
<< " ms." << std::endl;
std::cout << "This means we have achieved a speed up of " << duration / parallelDuration << std::endl;
}

我们可以,你只需要正确的类型。编译并执行以下代码:

#include <iostream>
#include <limits>
int main() {
std::cout << "Max int value: " << std::numeric_limits<int>::max() << 'n';
std::cout << "Max long value: " << std::numeric_limits<long>::max() << 'n';
std::cout << "Max long long value: " << std::numeric_limits<long long>::max() << 'n';
}

你的输出取决于你的ide/架构/编译器标志和其他东西,对我来说是这样的。

Max int value: 2147483647
Max long value: 2147483647
Max long long value: 9223372036854775807

至于为什么'当i至少为30时,它返回一个无意义的答案',溢出是UB,你不能依赖编译器在这种情况下会做什么。

最新更新