CUDA 和C++链接/编译，程序在 cudaMalloc 上崩溃

我有两个问题想向您提出。

一)

我有一个.cpp文件，其中是main()，为了调用内核(在.cu文件)，我对.cu文件launch()使用extern函数，它调用内核。两个文件(.cu和.cpp分别编译成功。为了将它们绑定在一起，因为我是CUDA初学者，我尝试了两件事：

1)nvcc -Wno-deprecated-gpu-targets -o final file1.cpp file2.cu，它不会给出错误并成功编译最终程序和

nvcc -Wno-deprecated-gpu-targets -c file2.cu
g++ -c file1.cpp
g++ -o program file1.o file2.o -lcudart -lcurand -lcutil -lcudpp -lcuda

在第二种情况下，由于无法识别-l参数(仅识别-lcuda)，我想是因为我不知道这些文件存储在哪里，因此我没有指定它们的路径。如果我跳过这些-l参数，错误是：

$ g++ -o final backpropalgorithm_CUDA_kernel_copy.o backpropalgorithm_CUDA_main_copy.o -lcuda
backpropalgorithm_CUDA_kernel_copy.o: In function `launch':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x185): undefined reference to `cudaConfigureCall'
backpropalgorithm_CUDA_kernel_copy.o: In function `__cudaUnregisterBinaryUtil()':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x259): undefined reference to `__cudaUnregisterFatBinary'
backpropalgorithm_CUDA_kernel_copy.o: In function `__nv_init_managed_rt_with_module(void**)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x274): undefined reference to `__cudaInitModule'
backpropalgorithm_CUDA_kernel_copy.o: In function `__device_stub__Z21neural_network_kernelPfPiS0_PdS1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_(float*, int*, int*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x2ac): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x2cf): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x2f2): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x315): undefined reference to `cudaSetupArgument'
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x338): undefined reference to `cudaSetupArgument'
backpropalgorithm_CUDA_kernel_copy.o:tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x35b): more undefined references to `cudaSetupArgument' follow
backpropalgorithm_CUDA_kernel_copy.o: In function `__nv_cudaEntityRegisterCallback(void**)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x663): undefined reference to `__cudaRegisterFunction'
backpropalgorithm_CUDA_kernel_copy.o: In function `__sti____cudaRegisterAll_69_tmpxft_0000717b_00000000_7_backpropalgorithm_CUDA_kernel_copy_cpp1_ii_43082cd7()':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x67c): undefined reference to `__cudaRegisterFatBinary'
backpropalgorithm_CUDA_kernel_copy.o: In function `cudaError cudaLaunch<char>(char*)':
tmpxft_0000717b_00000000-4_backpropalgorithm_CUDA_kernel_copy.cudafe1.cpp:(.text+0x6c0): undefined reference to `cudaLaunch'
backpropalgorithm_CUDA_main_copy.o: In function `main':
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x92): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0xf8): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x118): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x12c): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x14c): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x160): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x180): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x194): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1b4): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1c8): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1e8): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x1ff): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x21f): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x236): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x256): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x26a): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x28a): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2a1): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2c1): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2d5): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x2f5): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x309): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x329): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x33d): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x35d): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x371): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x391): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3a5): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3c5): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3dc): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x3fc): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x413): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x433): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x44a): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x46a): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x481): undefined reference to `cudaMalloc'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x4a1): undefined reference to `cudaMemcpy'
backpropalgorithm_CUDA_main_copy.cpp:(.text+0x5bf): undefined reference to `cudaDeviceSynchronize'
collect2: error: ld returned 1 exit status

问题是，在第一种情况下，"成功"编译和链接，当我运行程序时，它在控制台中仅显示一个闪烁的光标(在输入命令的下一行)，而没有其他内容; 通常它应该计算并显示正在构建的神经网络的错误，使用CUDA.

二) 我正在尝试printf().cu文件，但它没有显示任何内容。我搜索了一下，发现可能我应该使用cuPrintf()函数。我试过了，但是即使我手动包含它们，标题，它们也没有定义包含文件的问题。我发现我应该包含一个 cuPrintf.cu 文件，这是我在网上找到的源代码。

不幸的是，当我单独编译它们时，.cu文件的错误是

ptxas fatal   : Unresolved extern function '_Z8cuPrintfIjEiPKcT_'

不过，.cpp没有错误。

为什么会发生所有这些错误？错误的部分在哪里？为什么程序无法正常运行并且printf()似乎无法在内核中工作？为什么程序只显示一个闪烁的光标，仅此而已？如果有人能就这些问题中的任何一个启发我，我将不胜感激，

提前非常感谢！这两个文件的代码是：

file1.cpp：

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string>
#include "/home/user/include_files/cuda-8.0/include/cuda.h"
#include "/home/user/include_files/cuda-8.0/include/cuda_runtime.h"
#include "/home/user/include_files/cuda-8.0/include/cuda_runtime_api.h"
#define datanum 4       // number of training samples
#define InputN 16       // number of neurons in the input layer
#define hn 64           // number of neurons in the hidden layer
#define OutN 1          // number of neurons in the output layer
#define threads_per_block 256


using namespace std;
extern "C"
void launch(float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp);
__global__ void neural_network_kernel (float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp);
int main(int argc, char *argv[]){
printf("welcome1n");   
int times = 100000;
double sigmoid(double);
//string result = "";
char buffer[200];
printf("welcome2n");
double x_out[InputN];       // input layer
printf("welcome3n");
double hn_out[hn];          // hidden layer
printf("welcome4n");
double y_out[OutN];         // output layer
printf("welcome5n");
double y[OutN];             // expected output layer
printf("welcome6n");
double w[InputN][hn];       // weights from input layer to hidden layer
double v[hn][OutN];         // weights from hidden layer to output layer
double deltaw[InputN][hn];
double deltav[hn][OutN];
printf("welcome7n");
double hn_delta[hn];        // delta of hidden layer
double y_delta[OutN];       // delta of output layer
//double errlimit = 0.001;
double alpha = 0.1, beta = 0.1;
int i, j, m;
double sumtemp;
double errtemp;

/*cudaPrintfInit();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();*/
printf("Line : mainn");
// Training

/*struct{
double input[InputN];
double teach[OutN];
}data[datanum];
for(m=0; m<datanum; m++){
for(i=0; i<InputN; i++)
data[m].input[i] = (double)rand()/32767.0;
for(i=0;i<OutN;i++)
data[m].teach[i] = (double)rand()/32767.0;
}
// Initialization
for(i=0; i<InputN; i++){
for(j=0; j<hn; j++){
w[i][j] = ((double)rand()/32767.0)*2-1;
deltaw[i][j] = 0;
}
}
for(i=0; i<hn; i++){
for(j=0; j<OutN; j++){
v[i][j] = ((double)rand()/32767.0)*2-1;
deltav[i][j] = 0;
}
}*/

//curandGenerator_t gen;
srand (time(NULL));
float randData[threads_per_block];
printf("welcome8n");
for (int i=0; i<threads_per_block; i++)
{
randData[i] = rand()%100;   //Else, without %100, it returns some billions for number!
}
printf("welcome9n");
/*curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen, 1234ULL);
curandGenerateUniform(gen, randData, threads_per_block);*/
int loop = 0;
double error;
double max, min;
double *max_p_GPU, *min_p_GPU, *error_p_GPU;
float *randData_p_GPU;
int *times_p_GPU, *loop_p_GPU, *InputN_p_GPU, *hn_p_GPU, *OutN_p_GPU;
double *x_out_p_GPU, *hn_out_p_GPU, *y_out_p_GPU, *y_p_GPU, *w_p_GPU, *v_p_GPU, *deltaw_p_GPU, *deltav_p_GPU, *hn_delta_p_GPU;
double *y_delta_p_GPU, *alpha_p_GPU, *beta_p_GPU, *sumtemp_p_GPU, *errtemp_p_GPU;
//int blocks = times/threads_per_block;
printf("welcome10n");  
cudaMalloc((void **)&randData_p_GPU, threads_per_block*sizeof(float));
printf("DEBUG1n");
cudaMemcpy(randData_p_GPU, randData, threads_per_block*sizeof(float), cudaMemcpyHostToDevice);
printf("welcome11n");
cudaMalloc((void **)&times_p_GPU, sizeof(int));
printf("welcome12n");
cudaMemcpy(times_p_GPU, &times, sizeof(int), cudaMemcpyHostToDevice);
printf("welcome13n");
cudaMalloc((void **)&loop_p_GPU, sizeof(int));
printf("welcome14n");
cudaMemcpy(loop_p_GPU, &loop, sizeof(int), cudaMemcpyHostToDevice);
printf("welcome15n");
cudaMalloc((void **)&error_p_GPU, sizeof(double));
printf("welcome16n");
cudaMemcpy(error_p_GPU, &error, sizeof(double), cudaMemcpyHostToDevice);
printf("welcome17n");
cudaMalloc((void **)&max_p_GPU, sizeof(double));
printf("welcome18n");
cudaMemcpy(max_p_GPU, &max, sizeof(double), cudaMemcpyHostToDevice);
printf("welcome19n");
cudaMalloc((void **)&min_p_GPU, sizeof(double));
printf("welcome20n");
cudaMemcpy(min_p_GPU, &min, sizeof(double), cudaMemcpyHostToDevice);
printf("welcome21n");
/* cudaMalloc((void **)&InputN_p_GPU, sizeof(int));
cudaMemcpy(InputN_p_GPU, &InputN, sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void **)&hn_p_GPU, sizeof(int));
cudaMemcpy(hn_p_GPU, &hn, sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void **)&OutN_p_GPU, sizeof(int));
cudaMemcpy(OutN_p_GPU, &OutN, sizeof(int), cudaMemcpyHostToDevice); */
/*cudaMalloc((void **)&x_out_p_GPU, sizeof(double)*(threads_per_block*InputN));
cudaMemcpy(x_out_p_GPU, &x_out, sizeof(double)*InputN, cudaMemcpyHostToDevice);
cudaMalloc((void **)&hn_out_p_GPU, sizeof(double)*(threads_per_block*hn));
cudaMemcpy(hn_out_p_GPU, &hn_out, sizeof(double)*hn, cudaMemcpyHostToDevice);
cudaMalloc((void **)&y_out_p_GPU, sizeof(double)*(threads_per_block*OutN));
cudaMemcpy(y_out_p_GPU, &y_out, sizeof(double)*OutN, cudaMemcpyHostToDevice);
cudaMalloc((void **)&hn_delta_p_GPU, sizeof(double)*(threads_per_block*hn));
cudaMemcpy(hn_delta_p_GPU, &hn_delta, sizeof(double)*hn, cudaMemcpyHostToDevice);
cudaMalloc((void **)&y_delta_p_GPU, sizeof(double)*(threads_per_block*OutN));
cudaMemcpy(y_delta_p_GPU, &y_delta, sizeof(double)*OutN, cudaMemcpyHostToDevice);*/
cudaMalloc((void **)&x_out_p_GPU, sizeof(double)*InputN);
printf("welcome22n");
cudaMemcpy(x_out_p_GPU, &x_out, sizeof(double)*InputN, cudaMemcpyHostToDevice);
printf("welcome23n");
cudaMalloc((void **)&hn_out_p_GPU, sizeof(double)*hn);
printf("welcome24n");
cudaMemcpy(hn_out_p_GPU, &hn_out, sizeof(double)*hn, cudaMemcpyHostToDevice);
printf("welcome25n");
cudaMalloc((void **)&y_out_p_GPU, sizeof(double)*OutN);
printf("welcome26n");
cudaMemcpy(y_out_p_GPU, &y_out, sizeof(double)*OutN, cudaMemcpyHostToDevice);
printf("welcome27n");
cudaMalloc((void **)&hn_delta_p_GPU, sizeof(double)*hn);
printf("welcome28n");
cudaMemcpy(hn_delta_p_GPU, &hn_delta, sizeof(double)*hn, cudaMemcpyHostToDevice);
printf("welcome29n");
cudaMalloc((void **)&y_delta_p_GPU, sizeof(double)*OutN);
printf("welcome30n");
cudaMemcpy(y_delta_p_GPU, &y_delta, sizeof(double)*OutN, cudaMemcpyHostToDevice);
printf("welcome31n");
cudaMalloc((void **)&alpha_p_GPU, sizeof(double));
cudaMemcpy(alpha_p_GPU, &alpha, sizeof(double), cudaMemcpyHostToDevice);
cudaMalloc((void **)&beta_p_GPU, sizeof(double));
cudaMemcpy(beta_p_GPU, &beta, sizeof(double), cudaMemcpyHostToDevice);
cudaMalloc((void **)&sumtemp_p_GPU, sizeof(double));
cudaMemcpy(sumtemp_p_GPU, &sumtemp, sizeof(double), cudaMemcpyHostToDevice);
cudaMalloc((void **)&errtemp_p_GPU, sizeof(double));
cudaMemcpy(errtemp_p_GPU, &errtemp, sizeof(double), cudaMemcpyHostToDevice);
cudaMalloc((void **)&w_p_GPU, sizeof(double)*InputN*hn);
cudaMemcpy(w_p_GPU, &w, sizeof(double)*(InputN*hn), cudaMemcpyHostToDevice);
cudaMalloc((void **)&v_p_GPU, sizeof(double)*hn*OutN);
cudaMemcpy(v_p_GPU, &v, sizeof(double)*(hn*OutN), cudaMemcpyHostToDevice);
cudaMalloc((void **)&deltaw_p_GPU, sizeof(double)*InputN*hn);
cudaMemcpy(deltaw_p_GPU, &deltaw, sizeof(double)*(InputN*hn), cudaMemcpyHostToDevice);
cudaMalloc((void **)&deltav_p_GPU, sizeof(double)*hn*OutN);
cudaMemcpy(deltav_p_GPU, &deltav, sizeof(double)*(hn*OutN), cudaMemcpyHostToDevice);
printf("welcome40n");
launch(randData, times_p_GPU, loop_p_GPU, error_p_GPU, max_p_GPU, min_p_GPU, x_out_p_GPU, hn_out_p_GPU, y_out_p_GPU, y_p_GPU, w_p_GPU, v_p_GPU, deltaw_p_GPU, deltav_p_GPU, hn_delta_p_GPU, y_delta_p_GPU, alpha_p_GPU, beta_p_GPU, sumtemp_p_GPU, errtemp_p_GPU);
printf("welcome41n");
cudaDeviceSynchronize();
printf("welcome_after_kerneln");
}

file.cu：

#define w(i,j) w[(i)*(InputN*hn) + (j)]
#define v(i,j) v[(i)*(hn*OutN) + (j)]
#define x_out(i,j) x_out[(i)*(InputN) + (j)]
#define y(i,j) y[(i)*(OutN) + (j)]
#define hn_out(i,j) hn_out[(i)*(hn) + (j)]
#define y_out(i,j) y_out[(i)*(OutN) + (j)]
#define y_delta(i,j) y_delta[(i)*(OutN) + (j)]
#define hn_delta(i,j) hn_delta[(i)*(hn) + (j)]
#define deltav(i,j) deltav[(i)*(hn*OutN) + (j)]
#define deltaw(i,j) deltaw[(i)*(InputN*hn) + (j)]
#define datanum 4       // number of training samples
#define InputN 16       // number of neurons in the input layer
#define hn 64           // number of neurons in the hidden layer
#define OutN 1          // number of neurons in the output layer
#define threads_per_block 256
#define MAX_RAND 100
#define MIN_RAND 10
#include <stdio.h>
#include <math.h>   //for truncf()

// sigmoid serves as avtivation function
__device__ double sigmoid(double x){
return(1.0 / (1.0 + exp(-x)));
}

__device__ int rand_kernel(int index, float *randData){
float myrandf = randData[index];
myrandf *= (MAX_RAND - MIN_RAND + 0.999999);
myrandf += MIN_RAND;
int myrand = (int)truncf(myrandf);
return myrand;
}

__global__ void neural_network_kernel (float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp)
{
//int i = blockIdx.x;
//int idx = threadIdx.x;
//int idy = threadIdx.y
int index = blockIdx.x * blockDim.x + threadIdx.x;
// training set
struct{
double input_kernel[InputN];
double teach_kernel[OutN];
}data_kernel[threads_per_block + datanum];
if (index==0)
{
for(int m=0; m<datanum; m++){
for(int i=0; i<InputN; i++)
data_kernel[threads_per_block + m].input_kernel[i] = (double)rand_kernel(index, randData)/32767.0;
for(int i=0;i<OutN;i++)
data_kernel[threads_per_block + m].teach_kernel[i] = (double)rand_kernel(index, randData)/32767.0;
}
}

// Initialization
for(int i=0; i<InputN; i++){
for(int j=0; j<hn; j++){
w(i,j) = ((double)rand_kernel(index, randData)/32767.0)*2-1;
deltaw(i,j) = 0;
}
}
for(int i=0; i<hn; i++){
for(int j=0; j<OutN; j++){
v(i,j) = ((double)rand_kernel(index, randData)/32767.0)*2-1;
deltav(i,j) = 0;
}
}

while(loop[index] < *times){
loop[index]++;
error[index] = 0.0;
for(int m=0; m<datanum ; m++){
// Feedforward
max[index] = 0.0;
min[index] = 0.0;
for(int i=0; i<InputN; i++){
x_out(index,i) = data_kernel[threads_per_block + m].input_kernel[i];
if(max[index] < x_out(index,i))
max[index] = x_out(index,i);
if(min[index] > x_out(index,i))
min[index] = x_out(index,i);
}
for(int i=0; i<InputN; i++){
x_out(index,i) = (x_out(index,i) - min[index]) / (max[index] - min[index]);
}
for(int i=0; i<OutN ; i++){
y(index,i) = data_kernel[threads_per_block + m].teach_kernel[i];
}
for(int i=0; i<hn; i++){
sumtemp[index] = 0.0;
for(int j=0; j<InputN; j++)
sumtemp[index] += w(j,i) * x_out(index,j);
hn_out(index,i) = sigmoid(sumtemp[index]);      // sigmoid serves as the activation function
}
for(int i=0; i<OutN; i++){
sumtemp[index] = 0.0;
for(int j=0; j<hn; j++)
sumtemp[index] += v(j,i) * hn_out(index,j);
y_out(index,i) = sigmoid(sumtemp[index]);
}
// Backpropagation
for(int i=0; i<OutN; i++){
errtemp[index] = y(index,i) - y_out(index,i);
y_delta(index,i) = -errtemp[index] * sigmoid(y_out(index,i)) * (1.0 - sigmoid(y_out(index,i)));
error[index] += errtemp[index] * errtemp[index];
}
for(int i=0; i<hn; i++){
errtemp[index] = 0.0;
for(int j=0; j<OutN; j++)
errtemp[index] += y_delta(index,j) * v(i,j);
hn_delta(index,i) = errtemp[index] * (1.0 + hn_out(index,i)) * (1.0 - hn_out(index,i));
}
// Stochastic gradient descent
for(int i=0; i<OutN; i++){
for(int j=0; j<hn; j++){
deltav(j,i) = (*alpha) * deltav(j,i) + (*beta) * y_delta(index,i) * hn_out(index,j);
v(j,i) -= deltav(j,i);
}
}
for(int i=0; i<hn; i++){
for(int j=0; j<InputN; j++){
deltaw(j,i) = (*alpha) * deltaw(j,i) + (*beta) * hn_delta(index,i) * x_out(index,j);
w(j,i) -= deltaw(j,i);
}
}
}
// Global error
error[index] = error[index] / 2;
/*if(loop%1000==0){
result = "Global Error = ";
sprintf(buffer, "%f", error);
result += buffer;
result += "rn";
}
if(error < errlimit)
break;*/
printf("The %d th training, error: %0.100fn", loop[index], error[index]);
}
}

extern "C"
void launch(float *randData, int *times, int *loop, double *error, double *max, double *min, double *x_out, double *hn_out, double *y_out, double *y, double *w, double *v, double *deltaw, double *deltav, double *hn_delta, double *y_delta, double *alpha, double *beta, double *sumtemp, double *errtemp)
{
int blocks = *times/threads_per_block;
neural_network_kernel<<<blocks, threads_per_block>>>(randData, times, loop, error, max, min, x_out, hn_out, y_out, y, w, v, deltaw, deltav, hn_delta, y_delta, alpha, beta, sumtemp, errtemp);
}

更新：

我发现了一些关于指针内存分配的错误。我更新了上面的代码...现在的主要问题是：

1)链接/编译是否正确，这是我应该做的吗？我的意思是第一种方式。

2)我发现在第一个cudaMalloc()期间立即显示闪烁的光标。在此之前，它才能正常运行。

但起初cudaMalloc()它永远悬而未决，为什么？

在这里寻求帮助之前，最好使用正确的 cuda 错误检查并使用cuda-memcheck运行代码。如果你不这样做，你可能会忽略有用的错误信息，浪费你的时间以及其他人试图帮助你。

在第二种情况下，由于无法识别 -l 参数(只有 -lcuda 是)，我想是因为我不知道这些文件存储在哪里，所以我没有指定它们的路径。

您不想跳过这些。nvcc会自动为您链接到其中一些库，并自动知道在哪里可以找到它们。使用 g++ 时，您必须告诉它在哪里查找以及您需要的特定库。对于您展示的代码，您不需要要链接的所有库，因此以下内容就足够了：

g++ -o program file1.o file2.o -L/usr/local/cuda/lib64 -lcudart

用于 CUDA 的标准 Linux 安装。如果您没有标准安装，您可以执行which nvcc以找出nvcc的位置，然后使用它来查找库所在的可能位置(将路径中的bin更改为lib64)

如果您确实需要其他一些库，则cutil和cudpp之类的内容将不可用，除非您转到特殊步骤来安装它们，在这种情况下，您需要确定它们的路径。

关于cuPrintf，如果您在 cc2.0 或更新的 GPU 上编译和运行(无论如何，这是 CUDA 8 支持的最低计算能力)，则不需要它。普通printf应该在设备代码中工作，如果不是(因为您有设备代码错误 - 使用正确的错误检查和cuda-memcheck)，那么cuPrintf将无法更好地工作。因此，与其努力使其工作，不如将代码还原为使用printf(并包含stdio.h)。

关于您的程序以及为什么它不起作用，我认为您可能有很多错误。你可能想要了解如何使用调试器。首先，在主机代码中，您尝试从主机代码初始化randData是非法的。

现在我看到你已经多次改变了这个问题，把它变成了一个移动的目标，我会停下来。

如果您需要帮助，请停止移动目标。

使用正确的 cuda 错误检查。

相关内容

最新更新

热门标签：