SDAccel矩阵乘法优化（四）

从一个矩阵乘法的例子一步一步进行功能设计与性能优化。

mmult实现及优化步骤

矩阵乘法优化步骤

步骤

实现功能

关键概念/ Keywords

1、cpu实现

即在host端实现简单的矩阵乘法，便于比对数据与性能对比

---

2、OpenCL实现

在device端实现基于OpenCL的FPGA矩阵乘法硬件设计.

OpenCL接口函数

3、加入Local Memory

采用 Local Memory 减少数据访存次数

内核优化

局部内存

4、实现读写的突发传输

采用突发传输的方式更好的实现DDR与 Local Memory数据的读写访问

内核优化

突发读/写

5、数组分割

通过循环展开与数组分割的方式，实现更好的计算性能

数组分割

循环展开

流水线设计

方案分析及优化思路三（指令优化）

现在经过前面两次优化后，代码的组织结构没有什么问题了，现在的关键问题是：矩阵运算的嵌套for循环仅仅实现了内层的pipeline，因为外层for循环无法对内部的for循环flatten，所以外面两层的for循环没有实现pipeline。要解决这个问题，最直接的思路就是将最内层的for循环直接进行循环展开，进一步提高计算过程的并行度。但是在进行循环展开的过程中，需要将内层用到的数组进行切割，否则将无法进行unroll。因此，我们将用到的指令有三个：内层for循环要进行循环展开（unroll），并行计算用到的数组要进行数组切割（array partition），次外层的for循环要流水起来（pipeline）。

代码实现

#define MAX_SIZE 64

kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1,  //Read-only input matrix1
            __global int* in2,  //Read-only input matrix2
            __global int* out,  //Output matrix
            int dim             //One dimension of the matrix
          )
{
    //Local memory to store input matrices
    //Local memory is implemented as BRAM memory blocks
    //Complete partition done on dim 2 for in1, on dim 1 for in2 and on dim 2 for out
    __local int local_in1[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));
    __local int local_in2[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 1)));
    __local int local_out[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));

    //Burst reads on input matrices from DDR memory
    //Burst read for matrix local_in1 and local_in2
    read_in1: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        local_in1[i][j] = in1[iter];
    }
    read_in2: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        local_in2[i][j] = in2[iter];
    }

    //Based on the functionality the number of iterations
    //to be executed for "loop_3" must be "dim" size.
    //But for the pipeline to happen in the "loop_2" the
    //"loop_3" must be unrolled, to unroll the size cannot be dynamic.
    //It gives better throughput with usage of additional resources.
    loop_1: for(int i = 0; i < dim; i++){
        __attribute__((xcl_pipeline_loop))
        loop_2: for(int j = 0; j < dim; j++){
            local_out[i][j] = 0;
            __attribute__((opencl_unroll_hint))
            loop_3: for(int k = 0; k < MAX_SIZE; k++){
                local_out[i][j] += local_in1[i][k] * local_in2[k][ j];
            }
        }
    }

    //Burst write from local_out to DDR memory
    write_out: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        out[iter] = local_out[i][j];
    }
}

xcl_pipeline_loop

循环流水（Loop pipelining）是在一个迭代周期内的多个操作可实现并行处理。下图中的A展示了默认情况下的顺序执行操作，每次读操作之间相差3个时钟周期（II=3），离最后一次写操作相差8个时钟周期。图中的B展示了加入循环流水的示意图，每次读操作之间相差1个周期(II=1)，离最后一次写操作相差4个时钟周期，在使用同样的资源下，提高了流水线的启动间隔和延迟。

xcl_array_partition

下图所示的是数组分隔的三种方式：block、cyclic和complete。block和cyclic是根据factor参数的设置来决定划分多少个独立的RAM存储（如图中的factor=2）。按块划为block是在原数组上按连续存储元素的方式划分成factor个块单元；按轮询cyclic是在原数组上按交叉存储元素的方式划分成factor个块单元；按全部展开complete是把原数组划分为一个个独立的元素（寄存器）。

对于多维数组的分割，可按照数组的维度来划分：

实验结果分析

vivado hls log文件分析

WARNING: [XFORM 203-104] Completely partitioning array 'local_in1' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in1' in dimension 2 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_in2' accessed through non-constant indices on dimension 1 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in2' in dimension 1 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_out' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_out' in dimension 2 completely.
INFO: [XFORM 203-11] Balancing expressions in function 'mmult' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:44:41)...125 expression(s) balanced.
INFO: [HLS 200-111] Finished Pre-synthesis Time (s): cpu = 00:00:06 ; elapsed = 00:00:06 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45173
INFO: [XFORM 203-541] Flattening a loop nest 'loop_1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:182:42) in function 'mmult'.
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9).
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9).
INFO: [XFORM 203-811] Inferring bus burst write of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9).
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:07 ; elapsed = 00:00:07 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45174
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'read_in1'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'read_in2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'loop_1_loop_2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 12.
INFO: [SCHED 204-61] Pipelining loop 'write_out'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 5.
INFO: [SCHED 204-11] Finished scheduling.

HLS Report

综合结果分析

＊首先，硬件代码没有优化指令，log文件中首先将三个数组进行了对应维度的切割，然后也成功的对最内层的循环进行了unroll处理。＊然后，相比于Burst Read/Write版本的矩阵乘法实现，该版本主要是加上了两个优化指令，实现内层循环的并行化。从Pipleline的角度考虑：第一段for循环pipeline成功；第二段的for循环嵌套中write_data对应的for循环并行展开，接着对其外层的for循环进行flatten,最终，整体实现pipeline；第三段for循环pipeline成功。＊从pipeline成功后的II角度考虑:所有for循环pipeline后的II=1。

硬件仿真结果

硬件实现结果

mmult1

mmult2

mmult3

mmult4

总结

到此为止，我们已经完整的分析了一个SDAceel的例子。从CPU实现，再到FPGA实现，并一步一步进行优化设计。在FPGA的优化中主要包括两种优化方向：其一是基于带宽（Bandwidth）和数据吞吐率(Throughput)的优化；其二是基于计算性能(Performance)的优化。通常在设计的过程中，新手往往只会考虑到计算性能的优化，不断地提高并行度，但是在提高并行度的过程中，往往会导致带宽严重不足，数据吞吐率低，进一步使得并行的效果受到带宽的限制，以至于无法实现并行的效果。在实现矩阵乘法的例子中，我们首先做了基本的功能实现，紧接着对于Local Memory和Burst Write/Read的优化就是在提高访存效率，进而提高数据吞吐率。最后关于指令的优化，才是对于计算性能上的优化。结果对比表格如下所示：

方案	结果(ns)	提高倍数
未优化(mmult1)	365459518	1
Local Memory(mmult2)	2106545	173.5
Burst R/W(mmult3)	1616897	226.0
Pipeline(mmult4)	127392	2868.8

参考

xilinx github Xilinx/SDAccel_Examples/cpu_to_fpga ug1253 SDx Pragma Reference Guide 2017.2 ug1207 SDAccel Environment Optmizaton Guide