Intel OpenCL 之 Pipeline(三)不能pipeline的可能情况

时间:2022-07-22
本文章向大家介绍Intel OpenCL 之 Pipeline(三)不能pipeline的可能情况,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

Single work item形式的kernel来说,最重要的优化策略就是让loop能够pipeline,并且让II值尽可能为1

这里,对loop不能pipeline的几种情况进行归纳整理,大致可分为以下三种:Unresolving loop exit conditionNonlinear executionOut-of-order loop iterations

Unresolving Loop Exit Condition

原因

循环退出条件为访存或其他复杂操作,导致编译器在循环开始时不能推断循环退出边界

示例

下面的例子中,外层循环退出条件涉及到仿存操作,编译器没办法在loop开始时推断循环退出边界,导致pipeline失败。

#define N 128 __kernel void exitcond( __global unsigned* restrict input,
                                      __global unsigned* restrict result )
{

    unsigned i = 0;
    unsigned sum = 0;
    while( input[ i++ ] < N ) {
        for ( unsigned j = 0; j < N; j++ )
            sum += input[i+j];
    }
    *result = sum;
}
Loop Report:


-+ Loop "block1"
|  NOT pipelined due to:
|     Loop exit condition unresolvable at iteration initiation.
|
|-+ Loop "block2"
    Pipelined well. Successive iterations are launched every cycle.

解决方法:

修改代码结构,避免使用带复杂操作的循环退出边界。

Nonlinear Execution

原因

循环非线性执行,我们了解嵌套for循环的执行机理后就会明白,这种情况下,外层循环是没办法插入,自然也不能pipeline。

示例

下面的例子中,外层循环每次迭代时,其内层for循环是选择执行的,外层循环没办法做插入。

kernel void structure (global unsigned* restrict output1,
                        global unsigned* restrict output2,
                        int N)
{
    for (unsigned i = 0; i < N; i++) {
        if ((i & 3) == 0) {
            for (unsigned j = 0; j < N; j++) {
                output1[i+j] = i * j;
            }
        } else {
            for (unsigned j = 0; j < N; j++) {
                output2[i+j] = i * j;
            }
        }
    }
}
Loop Report:

 + Loop "Block2" (file test.cl line 5)
 | NOT pipelined due to:
 |
 |   Loop structure: loop contains divergent inner loops.
 |   Making all inner loops unconditional should fix this problem.
 |   See "Loop Structure Does Not Support Linear Execution" section of the Best Practices Guide for more information.
 |   Not pipelining this loop will most likely lead to poor performance.
 |
 |-+ Loop "Block3" (file test.cl line 7)
 |   Pipelined well. Successive iterations are launched every cycle.
 |
 |-+ Loop "Block4" (file test.cl line 11)
     Pipelined well. Successive iterations are launched every cycle.

解决方法(一种):

for (unsigned i = 0; i < N; i++)
{
    for (unsigned j = 0; j < N; j++) {
        ...
        output1=...
    }
    for (unsigned j = 0; j < N; j++) {
        ...
        output1=...
    }


    if ((i & 3) == 0) {
        output = output1;
    } else {
        output = output2;
    }
}

Out-of-Order Loop Iterations

原因

这是比较常见的一种情况,往往发生在嵌套循环处,通常由于每次外层循环迭代时,内层循环的迭代次数不固定导致。结果是外层嵌套的循环通通不能pipeline。

内层循坏迭代次数不固定的情况有很多,比如:

  • 循环边界为变量
for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<i; j++){

  }
}
  • 循环在if判断语句内,if判断条件会影响内层循环的迭代
for(unsigned i=0; i<N; i++){
  if(i>3){
    for(unsigned j=0; j<i; j++){

    }
  }
}
  • 循环体内有break语句
for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<i; j++){
    ...
    if(){
      break;//注意,break尽量不要用
    }
  }
}

上面这些情况,都会导致外层循环pipeline失败,因为编译器没办法在内层循环做pipeline时,对外层循环做插入。

正常情况下,pipeline应该是这样的:

pipeline-31

但是发生 out-of-order loop 时,循环的执行会变成下面这样,使性能大打折扣:

pipeline-32

示例

下面这个例子,属于上面第一种情况,即循坏边界为变量。

我们看,内层循环的边界是i,也就是说i=0时,内层循环迭代0次,i=1时,内层循环迭代1次,i=2时,内层循环迭代2次……,每次都是不一样。结果就是外层循环不能pipeline。

kernel void order( global unsigned* restrict input,
                    global unsigned* restrict output
                    int N )
{
    unsigned sum = 0;
    for (unsigned i = 0; i < N; i++) {
        for (unsigned j = 0; j < i; j++) {
            sum += input[i+j];
        }
    }
    output[0] = sum;
}
Loop Report:


-+ Loop "block1"
|  NOT pipelined due to:
|     Loop iteration ordering, iterations may get out of order with respect to:
|
|         Loop "block2"
|
|-+ Loop "block2"
   Pipelined well. Successive iterations are launched every cycle.

解决方法

修改算法,重新组织代码结构,比如:

  • 设法让循环边界为定值
for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<N; j++){
    if(j<i){
      ...
    }else{
      //空
    }
  }
}
  • 设法将if判断条件挪到for循环内
for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<N; j++){
    if(){
      ...
    }
  }
}

参考

[Intel FPGA SDK for OpenCL Best Practices Guide]