Intel OpenCL 之 Pipeline（三）不能pipeline的可能情况

对Single work item形式的kernel来说，最重要的优化策略就是让loop能够pipeline，并且让II值尽可能为1。

这里，对loop不能pipeline的几种情况进行归纳整理，大致可分为以下三种：Unresolving loop exit condition，Nonlinear execution 和 Out-of-order loop iterations。

Unresolving Loop Exit Condition

原因

循环退出条件为访存或其他复杂操作，导致编译器在循环开始时不能推断循环退出边界。

示例

下面的例子中，外层循环退出条件涉及到仿存操作，编译器没办法在loop开始时推断循环退出边界，导致pipeline失败。

#define N 128 __kernel void exitcond( __global unsigned* restrict input,
                                      __global unsigned* restrict result )
{

    unsigned i = 0;
    unsigned sum = 0;
    while( input[ i++ ] < N ) {
        for ( unsigned j = 0; j < N; j++ )
            sum += input[i+j];
    }
    *result = sum;
}

Loop Report:


-+ Loop "block1"
|  NOT pipelined due to:
|     Loop exit condition unresolvable at iteration initiation.
|
|-+ Loop "block2"
    Pipelined well. Successive iterations are launched every cycle.

解决方法：

修改代码结构，避免使用带复杂操作的循环退出边界。

Nonlinear Execution

原因

循环非线性执行，我们了解嵌套for循环的执行机理后就会明白，这种情况下，外层循环是没办法插入，自然也不能pipeline。

示例

下面的例子中，外层循环每次迭代时，其内层for循环是选择执行的，外层循环没办法做插入。

kernel void structure (global unsigned* restrict output1,
                        global unsigned* restrict output2,
                        int N)
{
    for (unsigned i = 0; i < N; i++) {
        if ((i & 3) == 0) {
            for (unsigned j = 0; j < N; j++) {
                output1[i+j] = i * j;
            }
        } else {
            for (unsigned j = 0; j < N; j++) {
                output2[i+j] = i * j;
            }
        }
    }
}

Loop Report:

 + Loop "Block2" (file test.cl line 5)
 | NOT pipelined due to:
 |
 |   Loop structure: loop contains divergent inner loops.
 |   Making all inner loops unconditional should fix this problem.
 |   See "Loop Structure Does Not Support Linear Execution" section of the Best Practices Guide for more information.
 |   Not pipelining this loop will most likely lead to poor performance.
 |
 |-+ Loop "Block3" (file test.cl line 7)
 |   Pipelined well. Successive iterations are launched every cycle.
 |
 |-+ Loop "Block4" (file test.cl line 11)
     Pipelined well. Successive iterations are launched every cycle.

解决方法（一种）：

for (unsigned i = 0; i < N; i++)
{
    for (unsigned j = 0; j < N; j++) {
        ...
        output1=...
    }
    for (unsigned j = 0; j < N; j++) {
        ...
        output1=...
    }


    if ((i & 3) == 0) {
        output = output1;
    } else {
        output = output2;
    }
}

Out-of-Order Loop Iterations

原因

这是比较常见的一种情况，往往发生在嵌套循环处，通常由于每次外层循环迭代时，内层循环的迭代次数不固定导致。结果是外层嵌套的循环通通不能pipeline。

内层循坏迭代次数不固定的情况有很多，比如：

循环边界为变量

for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<i; j++){

  }
}

循环在if判断语句内，if判断条件会影响内层循环的迭代

for(unsigned i=0; i<N; i++){
  if(i>3){
    for(unsigned j=0; j<i; j++){

    }
  }
}

循环体内有break语句

for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<i; j++){
    ...
    if(){
      break;//注意，break尽量不要用
    }
  }
}

上面这些情况，都会导致外层循环pipeline失败，因为编译器没办法在内层循环做pipeline时，对外层循环做插入。

正常情况下，pipeline应该是这样的：

pipeline-31

但是发生 out-of-order loop 时，循环的执行会变成下面这样，使性能大打折扣：

pipeline-32

示例

下面这个例子，属于上面第一种情况，即循坏边界为变量。

我们看，内层循环的边界是i，也就是说i=0时，内层循环迭代0次，i=1时，内层循环迭代1次，i=2时，内层循环迭代2次……，每次都是不一样。结果就是外层循环不能pipeline。

kernel void order( global unsigned* restrict input,
                    global unsigned* restrict output
                    int N )
{
    unsigned sum = 0;
    for (unsigned i = 0; i < N; i++) {
        for (unsigned j = 0; j < i; j++) {
            sum += input[i+j];
        }
    }
    output[0] = sum;
}

Loop Report:


-+ Loop "block1"
|  NOT pipelined due to:
|     Loop iteration ordering, iterations may get out of order with respect to:
|
|         Loop "block2"
|
|-+ Loop "block2"
   Pipelined well. Successive iterations are launched every cycle.

解决方法

修改算法，重新组织代码结构，比如：

设法让循环边界为定值

for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<N; j++){
    if(j<i){
      ...
    }else{
      //空
    }
  }
}

设法将if判断条件挪到for循环内

for(unsigned i=0; i<N; i++){
  for(unsigned j=0; j<N; j++){
    if(){
      ...
    }
  }
}

参考

[Intel FPGA SDK for OpenCL Best Practices Guide]