我正在实时处理视频流,我尝试使用GeForce GTX 960m进行处理。(Windows 10,VS 2013,CUDA 8.0)
每个帧必须捕获,轻轻融合,只要可以,我就需要对10个最新帧进行一些艰苦的计算。因此,我需要以30 fps捕获所有帧,我希望以5 fps的努力结果。
我的问题是,我无法以正确的节奏保持捕获的运行:看来,艰苦的计算在CPU级别或GPU级别下的帧捕获减慢。我想念一些框架...
我尝试了许多解决方案。没有工作:
- 我尝试在2个流(下图)上设置作业:
- 主机获得帧
- 第一个流(称为stream2):cudamemcpyasync复制设备上的框架。然后,第一个内核进行基本的浮雕计算。(在附件的图像中,蓝光显示为3.07 s和3.085 s的一个短插槽,然后什么也没有...直到大部分完成) )
- 主机检查第二个流是否可归功于Cudaevent,并在可能的情况下进行LAUCHES。实际上,该流可以进行1/2的尝试。
- 第二个流(称为stream4):在内核中启动努力计算(kernelcalcul_w2),输出结果并记录事件。
nsight Capture
实际上,我写道:
cudaStream_t sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
cudaEvent_t event_1;
cudaEventCreate(&event_1);
if (frame has arrived)
{
cudaMemcpyAsync(..., sHigh); // HtoD, to upload images in the GPU
blur_Image <<<... , sHigh>>> (...)
if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
else printf("Event 2 not readyn");
}
void hard_work( cudaStream_t sLow_)
{
kernelCalcul_W2<<<... , sLow_>>> (...);
cudaMemcpyAsync(... the result..., sLow_); //DtoH
cudaEventRecord(event_1, sLow_);
}
- 我只尝试使用一个流。它与上面的代码相同,但是在启动Hard_Work时更改1个参数。
- 主机获得帧
- 流:CudamemcpyAsync复制设备上的框架。然后,内核进行基本的浮动计算。然后,如果cudaevent event_1还可以,我会努力工作,我添加了一个event_1以在下一轮获得状态。实际上,该流始终可用:我从不属于"其他"部分。
这样,在艰苦的工作运行时,我希望"缓冲"所有要复制的帧,而不要丢失任何框架。但是我确实丢失了一些:事实证明,每次我得到一个帧并复制它时,Event_1似乎还可以,所以我启动了艰苦的工作,只有下一个帧很晚。
- 我试图将两个流(在C中)放在两个不同的线程中。不好(甚至更糟)。
所以问题是:如何确保第一流捕获所有帧?我真的感觉到不同的流阻挡了CPU。
我用OpenGL显示图像。会干扰吗?
关于改善这种方法的方法的想法吗?非常感谢!
编辑:根据要求,我放在这里一个McVe。
有一个参数可以调整(#Define调整)以查看发生的事情。基本上,主过程以异步模式发送CUDA请求,但似乎会阻止主线程。正如您将在图像中看到的那样,我每30毫秒都有"内存访问"(即捕获的图像),除非艰苦的工作正在运行(然后,我只是没有得到映像)。
最后一个细节:我正在使用CUDA 7.5来运行此操作。我尝试安装8.0,但显然编译器仍然是7.5
#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, b, n, n0;
float input;
b = 3;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s n", cudaGetErrorString(err));
else printf("running okn");
// copy the result back to Host
//printf(" %p %p n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
// storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
free(images_as_Output_in_Host);
free(camera);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("nDealing with image %dn", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %dn", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
Sleep(30); // to simulate Camera
}
releaseMapFile();
getchar();
}
这里的主要问题是,如果固定涉及的主机(即使用cudaHostAlloc
分配),则cudaMemcpyAsync
只是正确的非阻滞异步操作。该特征在多个地方涵盖,包括API文档和相关编程指南部分。
对您的代码进行以下修改(在Linux上运行,我更喜欢)演示了行为差异:
$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, n, n0;
float input;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
#ifndef QUICK
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s n", cudaGetErrorString(err));
else printf("running okn");
// copy the result back to Host
//printf(" %p %p n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize/2, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
#endif
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
cudaEventRecord(event_2, s2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("nDealing with image %dn", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %dn", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
usleep(30000); // to simulate Camera
}
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("some CUDA error: %sn", cudaGetErrorString(err));
releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard work is launched after image 1 is captured ==> running ok
Dealing with image 2
Hard work is launched after image 2 is captured ==> running ok
Dealing with image 3
Hard work is launched after image 3 is captured ==> running ok
Dealing with image 4
Hard work is launched after image 4 is captured ==> running ok
Dealing with image 5
Hard work is launched after image 5 is captured ==> running ok
Dealing with image 6
Hard work is launched after image 6 is captured ==> running ok
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard work is launched after image 8 is captured ==> running ok
Dealing with image 9
Hard work is launched after image 9 is captured ==> running ok
real 0m2.790s
user 0m0.688s
sys 0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard_work still running, so unable to process after image 1
Dealing with image 2
Hard_work still running, so unable to process after image 2
Dealing with image 3
Hard_work still running, so unable to process after image 3
Dealing with image 4
Hard_work still running, so unable to process after image 4
Dealing with image 5
Hard_work still running, so unable to process after image 5
Dealing with image 6
Hard_work still running, so unable to process after image 6
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard_work still running, so unable to process after image 8
Dealing with image 9
Hard_work still running, so unable to process after image 9
real 0m1.721s
user 0m0.028s
sys 0m0.629s
$
在上面的USE_HOST_ALLOC
情况下,低优先级内核的发射模式是间歇性的,如预期的,整个运行时间都短得多。
简而言之,如果您希望cudaMemcpyAsync
中的预期行为,请确保任何参与的主机分配均为锁定。
在此答案中可以看到固定对多流行为行为的效果的示例。