C-所需的时间将功能卸载到Intel Xeon Phi



是否有一个预定义的时间来卸载呼叫需要将函数的数据(参数)从主机传输到Intel Mic(Xeon Phi Phi Phi Coprocessor 3120系列)?

>

特别是我要在麦克风上执行的函数,我要进行卸载调用(" #pragma卸载目标(MIC)")。该函数具有15个参数(指示和变量),我已经确认了参数在MIC上的正确传递。但是,我已经简化了代码,目的是检查参数传递的时间,因此它仅包含一个简单的" printf()"函数。我使用" sys/time.h"标头文件的" getTimeofday()"来测量以下代码中的时间:

主机的一些硬件信息:Intel(R)Core(TM)I7-3770 CPU @ 3.40GHz/CENTOS版本6.8/PCI Express修订版2.0

main.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>
__attribute__ (( target (mic))) unsigned long long ForSolution = 0;
__attribute__ (( target (mic))) unsigned long long sufficientSol = 1;
__attribute__ (( target (mic))) float timer = 0.0;
__attribute__ (( target (mic))) void function(float *grid, float *displ, unsigned long long *li, unsigned long long *repet, float *solution, unsigned long long dim, unsigned long long numOfa, unsigned long long numLoops, unsigned long long numBlock, unsigned long long thread, unsigned long long blockGrid, unsigned long long station, unsigned long long bytesSol, unsigned long long totalSol, volatile unsigned long long *prog);
   float    *grid, *displ, *solution;
   unsigned long long   *li,repet;
   volatile unsigned long long  *prog;
   unsigned long long dim = 10, grid_a = 3, numLoops = 2, numBlock = 0;
   unsigned long long thread = 220, blockGrid = 0, station = 12;
   unsigned long long station_at = 8, bytesSol, totalSol;
   bytesSol = dim*sizeof(float);
   totalSol = ((1024 * 1024 * 1024) / bytesSol) * bytesSol;

   /******** Some memcpy() functions here for the pointers*********/                   

gettimeofday(&start, NULL);
   #pragma offload target(mic) 
        in(grid:length(dim * grid_a * sizeof(float))) 
        in(displ:length(station * station_at * sizeof(float))) 
        in(li:length(dim * sizeof(unsigned long long))) 
        in(repet:length(dim * sizeof(unsigned long long))) 
        out(solution:length(totalSol/sizeof(float))) 
        in(dim,grid_a,numLoops,numBlock,thread,blockGrid,station,bytesSol,totalSol) 
        in(prog:length(sizeof(volatile unsigned long long))) 
        inout(ForSolution,sufficientSol,timer)
   {
        function(grid, displ, li, repet, solution, dim, grid_a, numLoops, numBlock, thread, blockGrid, station, bytesSol, totalSol, prog);
   }
    gettimeofday(&end, NULL);  
    printf("Time to tranfer data on Intel Xeon Phi: %f secn", (((end.tv_sec - start.tv_sec) * 1000000.0 + (end.tv_usec - start.tv_usec)) / 1000000.0) - timer);
    printf("Time for calculations: %f secn", timer);

function.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>
#include <omp.h>
void function(float *grid, float *displ, unsigned long long *li, unsigned long long *repet, float *solution, unsigned long long dim, unsigned long long numOfa, unsigned long long numLoops, unsigned long long numBlock, unsigned long long thread, unsigned long long blockGrid, unsigned long long station, unsigned long long bytesSol, unsigned long long totalSol, volatile unsigned long long *prog)
{
    struct timeval      timer_start, timer_end;
    gettimeofday(&timer_start, NULL);
printf("Hello World!!!n");

    gettimeofday(&timer_end, NULL);
    timer = ((timer_end.tv_sec - timer_start.tv_sec) * 1000000.0 + (timer_end.tv_usec - timer_start.tv_usec)) / 1000000.0 ;  
}

终端的结果:

Time to tranfer data on Intel Xeon Phi: 3.512706 sec
Time for calculations: 0.000002 sec
Hello World!!!

代码需要3.5秒才能完成"卸载目标"。上述结果正常吗?有什么方法可以减少卸载电话的大量时间延迟?

让我们看一下这里的步骤:

a)对于第一个#pragma offload,麦克风是初始化的;这可能包括重置它,启动剥离的Linux(并等待它启动所有CPU,初始化其内存管理,启动Psoudo-Nic驱动程序等),然后将代码上传到设备上。这可能仅需几秒钟。

b)所有输入数据都上传到麦克风。

c)执行该功能。

d)从麦克风下载了所有输出数据。

对于PCI Express Revision 2.0(x16)上的原始数据传输最大。带宽为8 GB/s;但是,您不会获得最大。带宽。据我记得,与PHI的交流涉及共享的戒指缓冲区和"门铃" IRQ,并在双方(主机上和协处理器的操作系统)上与"伪NIC"驱动程序;而且,如果您获得最大一半,我会感到惊讶。带宽。

我认为上传,上传数据和下载数据的代码总量超过1 GIB(例如,out(solution:length(totalSol/sizeof(float)))本身是1 GIB)。如果我们假设"大约4个gib/s",那至少是〜250毫秒。

我的建议是要做两次的事情。并测量第一次(包括初始化所有内容)和第二次(当所有内容已经初始化时)之间的差异,以确定初始化协处理器所需的时间。第二个测量(减去执行功能的时间)将告诉您数据传输的时间。

最新更新