如何在多线程GPU应用程序中管理cuda流和TensorRT上下文

对于tensorrt trt文件，我们将把它加载到引擎中，并为引擎创建tensorrt上下文。然后通过调用context->enqueueV2((。

在创建Tensorrt上下文后，我们是否需要调用cudaCreateStream((？或者只需要在选择GPU设备后调用SetDevice((？TensorRT如何将cuda流与TensorRT上下文关联起来？

我们可以在一个Tensorrt上下文中使用多个流吗？

在多线程C++应用程序中，每个线程使用一个模型进行推理，一个模型可能加载在多个线程中；那么，在一个线程中，我们只需要1个引擎、1个上下文和1个流还是多个流？

在创建Tensorrt上下文后，我们需要调用cudaCreateStream((吗？

所说的cudaCreateStream()是指cudaStreamCreate()吗？

您可以在创建引擎和运行时之后创建它们。

作为一个额外的琐事，你根本不必使用CUDA流。我尝试将数据从主机复制到设备，调用enqueueV2()，然后在不使用CUDA流的情况下将数据从设备复制到主机。它运行良好。

TensorRT如何将cuda流和TensorRT上下文关联起来？

关联是，您可以将相同的CUDA流作为参数传递给所有函数调用。以下c++代码将对此进行说明：

void infer(std::vector<void*>& deviceMemory, void* hostInputMemory, size_t hostInputMemorySizeBytes, cudaStream_t& cudaStream)
{
auto success = cudaMemcpyAsync(deviceMemory, hostInputMemory, hostInputMemorySizeBytes, cudaMemcpyHostToDevice, cudaStream)
if (not success) {... handle errors...}
if (not executionContext.enqueueV2(static_cast<void**>(deviceMemory.data()), cudaStream, nullptr)
{ ... handle errors...}
void* outputHostMemory; // allocate size for all bindings
size_t outputMemorySizeBytes;
auto success2 = cudaMemcpyAsync(&outputHostMemory, &deviceMemory.at(0), outputMemorySizeBytes, cudaMemcpyDeviceToHost, cudaStream);
if (not success2) {... error handling ...}
cudaStream.waitForCompletion();
}

如果您想要一个完整的c++工作示例，您可以检查这个存储库。我上面的代码只是一个例子。

我们可以在一个Tensorrt上下文中使用多个流吗？

如果我正确理解你的问题，根据这份文件，答案是否定的。

在多线程C++应用程序中，每个线程使用一个模型进行推理，一个模型可能加载在多个线程中；那么，在一个线程中，我们只需要1个引擎、1个上下文和1个流还是多个流？

one model might be loaded in more than 1 thread

这听起来不对。

根据TensorRT引擎文件创建一个引擎(nvinfer1::ICudaEngine(。引擎创建一个用于推理的执行上下文。

TensorRT开发者指南的这一部分说明了哪些操作是线程安全的。其余部分可以被认为是非线程安全的。

相关内容

最新更新

热门标签：