以编程方式使用 openCL 选择最佳可用 GPU 时出现问题



我正在使用这里给出的建议为我的算法选择最佳 GPU。 https://stackoverflow.com/a/33488953/5371117

我使用boost::compute::system::devices();查询MacBook Pro上的设备,这返回了以下设备列表。

Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Intel(R) UHD Graphics 630
AMD Radeon Pro 560X Compute Engine

我想将AMD Radeon Pro 560X Compute Engine用于我的目的,但是当我迭代以找到最大额定值=CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS的设备时。我得到以下结果:

Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, 
freq: 2600, compute units: 12, rating:31200
Intel(R) UHD Graphics 630, 
freq: 1150, units: 24, rating:27600
AMD Radeon Pro 560X Compute Engine, 
freq: 300, units: 16, rating:4800

AMD GPU 的评分最低。我还查看了规格,在我看来,CL_DEVICE_MAX_CLOCK_FREQUENCY没有返回正确的值。

根据 AMD 芯片规格 https://www.amd.com/en/products/graphics/radeon-rx-560x,我的 AMD GPU 的基本频率为 1175 MHz,而不是 300MHz

根据英特尔芯片规格 https://en.wikichip.org/wiki/intel/uhd_graphics/630,我的英特尔GPU的基本频率为300 MHz,而不是1150MHz,但它的提升频率确实为1150MHz

std::vector<boost::compute::device> devices = boost::compute::system::devices();
std::pair<boost::compute::device, ai::int64> suitableDevice{};
for(auto& device: devices)
{
auto rating = device.clock_frequency() * device.compute_units();
std::cout << device.name() << ", freq: " << device.clock_frequency() << ", units: " << device.compute_units() << ", rating:" << rating << std::endl;
if(suitableDevice.second < benchmark)
{
suitableDevice.first = device;
suitableDevice.second = benchmark; 
}
}      

我做错了什么吗?

不幸的是,这些属性只能在实现中真正直接比较(相同的硬件制造商,相同的操作系统(。

我的建议是:

  • 首先过滤掉设备类型不是CL_DEVICE_TYPE_GPU的任何内容(除非没有任何可用的 GPU,在这种情况下,您可能需要回退到 CPU(。
  • 检查任何其他重要的设备属性。例如,如果您需要对特定 OpenCL 版本或扩展的支持,或者如果您需要特别大的工作组或本地内存,请检查所有剩余的设备并过滤掉任何无法运行代码的设备。
  • 测试是否有任何剩余设备为CL_DEVICE_HOST_UNIFIED_MEMORY属性返回 true。这些将是集成的GPU,它们通常比离散的GPU慢,除非您受到数据传输速度的约束,在这种情况下,它们可能会更快。因此,您需要更喜欢一种类型而不是另一种类型。
  • 如果在此之后您仍然拥有多个设备,则可以应用现有的启发式方法。

此代码将返回具有最多浮点性能的设备

select_device_with_most_flops(find_devices());

这是内存最多的设备

select_device_with_most_memory(find_devices());

首先,find_devices()返回系统中所有 OpenCL 设备的向量。select_device_with_most_memory()很简单,使用getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()

浮点性能由以下等式给出:FLOP/s = 内核/CU * CU * IPC * 时钟频率

select_device_with_most_flops()更困难,因为 OpenCL 仅提供计算单元 (CU( 的数量getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(),对于 CPU 来说,这是线程数,对于 GPU 来说,必须乘以每个 CU 的流处理器/cuda 内核的数量,这对于 Nvidia、AMD 和 Intel 以及它们不同的微架构是不同的,通常在 4 到 128 之间。幸运的是,供应商包含在getInfo<CL_DEVICE_VENDOR>()中。因此,根据供应商和 CU 的数量,可以计算出每个 CU 的内核数。

下一部分是 FP32 IPC 或每时钟指令。对于大多数 GPU,这是 2,而对于最近的 CPU,这是 32,请参阅 https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors 没有办法直接计算出OpenCL中的IPC,所以CPU的32只是一个猜测。可以使用设备名称和查找表来更准确。 如果设备是 GPU,则getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU将导致 true。

最后一部分是时钟频率。OpenCL 以getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()为单位提供以 MHz 为单位的基本时钟频率。该设备可以提高更高的频率,所以这又是一个近似值。

所有这些共同给出了浮点性能的估计值。完整代码如下所示:

typedef unsigned int uint;
string trim(const string s) { // removes whitespace characters from beginnig and end of string s
const int l = (int)s.length();
int a=0, b=l-1;
char c;
while(a<l && ((c=s.at(a))==' '||c=='t'||c=='n'||c=='v'||c=='f'||c=='r'||c=='')) a++;
while(b>a && ((c=s.at(b))==' '||c=='t'||c=='n'||c=='v'||c=='f'||c=='r'||c=='')) b--;
return s.substr(a, 1+b-a);
}
bool contains(const string s, const string match) {
return s.find(match)!=string::npos;
}
vector<Device> find_devices() {
vector<Platform> platforms; // get all platforms (drivers)
vector<Device> devices_available;
vector<Device> devices; // get all devices of all platforms
Platform::get(&platforms);
if(platforms.size()==0) print_error("There are no OpenCL devices available. Make sure that the OpenCL 1.2 Runtime for your device is installed. For GPUs it comes by default with the graphics driver, for CPUs it has to be installed separately.");
for(uint i=0; i<(uint)platforms.size(); i++) {
devices_available.clear();
platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices_available); // CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU
if(devices_available.size()==0) continue; // no device of type device_type found in plattform i
for(uint j=0; j<(uint)devices_available.size(); j++) devices.push_back(devices_available[j]);
}
print_device_list(devices);
return devices;
}
Device select_device_with_most_flops(const vector<Device> devices) { // return device with best floating-point performance
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
//const string device_name = trim(d.getInfo<CL_DEVICE_NAME>());
const string device_vendor = trim(d.getInfo<CL_DEVICE_VENDOR>()); // is either Nvidia, AMD or Intel
const uint device_compute_units = (uint)d.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
const bool device_is_gpu = d.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
const uint device_ipc = device_is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
const uint nvidia = (uint)(contains(device_vendor, "NVIDIA")||contains(device_vendor, "vidia"))*(device_compute_units<=30u?128u:64u); // Nvidia GPUs usually have 128 cores/CU, except Volta/Turing (>30 CUs) which have 64 cores/CU
const uint amd = (uint)(contains(device_vendor, "AMD")||contains(device_vendor, "ADVANCED")||contains(device_vendor, "dvanced"))*(device_is_gpu?64u:1u); // AMD GCN GPUs usually have 64 cores/CU, AMD CPUs have 1 core/CU
const uint intel = (uint)(contains(device_vendor, "INTEL")||contains(device_vendor, "ntel"))*(device_is_gpu?8u:1u); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs have 1 core/CU
const uint device_cores = device_compute_units*(nvidia+amd+intel);
const uint device_clock_frequency = (uint)d.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
const float device_tflops = 1E-6f*(float)device_cores*(float)device_ipc*(float)device_clock_frequency; // estimated device floating point performance in TeraFLOPs/s
if(device_tflops>best_value) { // device_memory>best_value
best_value = device_tflops; // best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_most_memory(const vector<Device> devices) { // return device with largest memory capacity
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
const float device_memory = 1E-3f*(float)(d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/1048576ull); // in GB
if(device_memory>best_value) {
best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_id(const vector<Device> devices, const int id) { // return device
if(id>=0&&id<(int)devices.size()) {
return devices[id];
} else {
print("Your selected device ID ("+to_string(id)+") is wrong.");
return devices[0]; // is never executed, just to avoid compiler warnings
}
}
>UPDATE:我现在在轻量级OpenCL-Wrapper中包含了它的改进版本。这将正确计算过去十年左右所有 CPU 和 GPU 的 FLOP:https://github.com/ProjectPhysX/OpenCL-Wrapper

最新更新