SDAccel矩阵乘法优化(四)

时间:2022-07-22
本文章向大家介绍SDAccel矩阵乘法优化(四),主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

从一个矩阵乘法的例子一步一步进行功能设计与性能优化。

mmult实现及优化步骤

矩阵乘法优化步骤

步骤

实现功能

关键概念/ Keywords

1、cpu实现

即在host端实现简单的矩阵乘法,便于比对数据与性能对比

---

2、OpenCL实现

在device端实现基于OpenCL的FPGA矩阵乘法硬件设计.

OpenCL接口函数

3、加入Local Memory

采用 Local Memory 减少数据访存次数

内核优化

局部内存

4、实现读写的突发传输

采用突发传输的方式更好的实现DDR与 Local Memory数据的读写访问

内核优化

突发读/写

5、数组分割

通过循环展开与数组分割的方式,实现更好的计算性能

数组分割

循环展开

流水线设计

方案分析及优化思路三(指令优化)

现在经过前面两次优化后,代码的组织结构没有什么问题了,现在的关键问题是:矩阵运算的嵌套for循环仅仅实现了内层的pipeline,因为外层for循环无法对内部的for循环flatten,所以外面两层的for循环没有实现pipeline。要解决这个问题,最直接的思路就是将最内层的for循环直接进行循环展开,进一步提高计算过程的并行度。但是在进行循环展开的过程中,需要将内层用到的数组进行切割,否则将无法进行unroll。因此,我们将用到的指令有三个:内层for循环要进行循环展开(unroll),并行计算用到的数组要进行数组切割(array partition),次外层的for循环要流水起来(pipeline)。

代码实现

#define MAX_SIZE 64

kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void mmult( __global int* in1,  //Read-only input matrix1
            __global int* in2,  //Read-only input matrix2
            __global int* out,  //Output matrix
            int dim             //One dimension of the matrix
          )
{
    //Local memory to store input matrices
    //Local memory is implemented as BRAM memory blocks
    //Complete partition done on dim 2 for in1, on dim 1 for in2 and on dim 2 for out
    __local int local_in1[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));
    __local int local_in2[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 1)));
    __local int local_out[MAX_SIZE][MAX_SIZE] __attribute__((xcl_array_partition(complete, 2)));

    //Burst reads on input matrices from DDR memory
    //Burst read for matrix local_in1 and local_in2
    read_in1: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        local_in1[i][j] = in1[iter];
    }
    read_in2: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        local_in2[i][j] = in2[iter];
    }

    //Based on the functionality the number of iterations
    //to be executed for "loop_3" must be "dim" size.
    //But for the pipeline to happen in the "loop_2" the
    //"loop_3" must be unrolled, to unroll the size cannot be dynamic.
    //It gives better throughput with usage of additional resources.
    loop_1: for(int i = 0; i < dim; i++){
        __attribute__((xcl_pipeline_loop))
        loop_2: for(int j = 0; j < dim; j++){
            local_out[i][j] = 0;
            __attribute__((opencl_unroll_hint))
            loop_3: for(int k = 0; k < MAX_SIZE; k++){
                local_out[i][j] += local_in1[i][k] * local_in2[k][ j];
            }
        }
    }

    //Burst write from local_out to DDR memory
    write_out: for(int iter = 0, i = 0, j = 0; iter < dim * dim; iter++, j++){
        if(j == dim){ j = 0; i++; }
        out[iter] = local_out[i][j];
    }
}
  • xcl_pipeline_loop

循环流水(Loop pipelining)是在一个迭代周期内的多个操作可实现并行处理。下图中的A展示了默认情况下的顺序执行操作,每次读操作之间相差3个时钟周期(II=3),离最后一次写操作相差8个时钟周期。图中的B展示了加入循环流水的示意图,每次读操作之间相差1个周期(II=1),离最后一次写操作相差4个时钟周期,在使用同样的资源下,提高了流水线的启动间隔和延迟。

  • xcl_array_partition

下图所示的是数组分隔的三种方式:blockcycliccompleteblockcyclic是根据factor参数的设置来决定划分多少个独立的RAM存储(如图中的factor=2)。按块划为block是在原数组上按连续存储元素的方式划分成factor个块单元;按轮询cyclic是在原数组上按交叉存储元素的方式划分成factor个块单元;按全部展开complete是把原数组划分为一个个独立的元素(寄存器)。

对于多维数组的分割,可按照数组的维度来划分:

实验结果分析

  • vivado hls log文件分析
WARNING: [XFORM 203-104] Completely partitioning array 'local_in1' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in1' in dimension 2 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_in2' accessed through non-constant indices on dimension 1 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_in2' in dimension 1 completely.
WARNING: [XFORM 203-104] Completely partitioning array 'local_out' accessed through non-constant indices on dimension 2 (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9), which may result in long runtime and suboptimal QoR due to large multiplexers. Please consider wrapping the array access into a function or using a register file core instead.
INFO: [XFORM 203-101] Partitioning array 'local_out' in dimension 2 completely.
INFO: [XFORM 203-11] Balancing expressions in function 'mmult' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:44:41)...125 expression(s) balanced.
INFO: [HLS 200-111] Finished Pre-synthesis Time (s): cpu = 00:00:06 ; elapsed = 00:00:06 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45173
INFO: [XFORM 203-541] Flattening a loop nest 'loop_1' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:182:42) in function 'mmult'.
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:170:9).
INFO: [XFORM 203-811] Inferring bus burst read of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:174:9).
INFO: [XFORM 203-811] Inferring bus burst write of variable length on port 'gmem' (/home/lab611/workspace/xuke/mmult_test/src/mmult.cl:196:9).
INFO: [HLS 200-111] Finished Architecture Synthesis Time (s): cpu = 00:00:07 ; elapsed = 00:00:07 . Memory (MB): peak = 494.324 ; gain = 156.758 ; free physical = 19781 ; free virtual = 45174
INFO: [HLS 200-10] Starting hardware synthesis ...
INFO: [HLS 200-10] Synthesizing 'mmult' ...
WARNING: [SYN 201-107] Renaming port name 'mmult/out' to 'mmult/out_r' to avoid the conflict with HDL keywords or other object names.
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [HLS 200-42] -- Implementing module 'mmult'
INFO: [HLS 200-10] ----------------------------------------------------------------
INFO: [SCHED 204-11] Starting scheduling ...
INFO: [SCHED 204-61] Pipelining loop 'read_in1'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'read_in2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 3.
INFO: [SCHED 204-61] Pipelining loop 'loop_1_loop_2'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 12.
INFO: [SCHED 204-61] Pipelining loop 'write_out'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 5.
INFO: [SCHED 204-11] Finished scheduling.
  • HLS Report
  • 综合结果分析

* 首先,硬件代码没有优化指令,log文件中首先将三个数组进行了对应维度的切割,然后也成功的对最内层的循环进行了unroll处理。 * 然后,相比于Burst Read/Write版本的矩阵乘法实现,该版本主要是加上了两个优化指令,实现内层循环的并行化。从Pipleline的角度考虑:第一段for循环pipeline成功;第二段的for循环嵌套中write_data对应的for循环并行展开,接着对其外层的for循环进行flatten,最终,整体实现pipeline;第三段for循环pipeline成功。 * 从pipeline成功后的II角度考虑:所有for循环pipeline后的II=1

  • 硬件仿真结果
  • 硬件实现结果

mmult1

mmult2

mmult3

mmult4

总结

到此为止,我们已经完整的分析了一个SDAceel的例子。从CPU实现,再到FPGA实现,并一步一步进行优化设计。在FPGA的优化中主要包括两种优化方向:其一是基于带宽(Bandwidth)和数据吞吐率(Throughput)的优化;其二是基于计算性能(Performance)的优化。通常在设计的过程中,新手往往只会考虑到计算性能的优化,不断地提高并行度,但是在提高并行度的过程中,往往会导致带宽严重不足,数据吞吐率低,进一步使得并行的效果受到带宽的限制,以至于无法实现并行的效果。在实现矩阵乘法的例子中,我们首先做了基本的功能实现,紧接着对于Local MemoryBurst Write/Read的优化就是在提高访存效率,进而提高数据吞吐率。最后关于指令的优化,才是对于计算性能上的优化。结果对比表格如下所示:

方案

结果(ns)

提高倍数

未优化(mmult1)

365459518

1

Local Memory(mmult2)

2106545

173.5

Burst R/W(mmult3)

1616897

226.0

Pipeline(mmult4)

127392

2868.8

参考

xilinx github Xilinx/SDAccel_Examples/cpu_to_fpga ug1253 SDx Pragma Reference Guide 2017.2 ug1207 SDAccel Environment Optmizaton Guide