#### Morpher: An Open-Source Integrated Compilation and Simulation Framework for CGRA

Dhananjaya Wijerathne<sup>\*</sup>, Zhaoying Li<sup>\*</sup>, Manupa Karunaratne, Li-Shiuan Peh<sup>\*</sup>, Tulika Mitra<sup>\*</sup> School of Computing, National University of Singapore<sup>\*</sup>, Advanced Micro Devices, Inc {dmd, zhaoying, peh, tulika}@comp.nus.edu.sg, Manupa.Karunaratne@amd.com



#### CGRA: Coarse Grained Reconfigurable Array

- Power-efficient reconfigurable accelerator
  - Word level reconfigurable
- Simple hardware architecture
  - Array of Processing Elements (PE) interconnected through a reconfigurable network
  - Each PE has ALU, register file, multiplexers and configuration memory
- Fully software controlled
  - Compiler statically generates the configurations
- Commercially available



Samsung Reconfigurable Processor [FPT '12]



Renessas DRP [VLSI ' 18]



[ISCA'17]



NUS HyCUBE [DAC '17, A-SSCC' 19]





#### CGRA Architecture

### Application mapping on CGRA

- Target: a loop kernel from applications
- Mapping the dataflow graph (DFG) of the loop body on to the CGRA
  - Placement: assigning DFG operations to ALUs
  - Routing: mapping data signals using wires and registers





N3

F4

DFG

2x2 CGRA

F3

### Application mapping on CGRA

- Software pipelined schedule
- Goal: Mapping with minimum initiation interval
- Initiation interval (II) = cycle difference between initiation of consecutive iterations
- Low II -> High performance



Low performance mapping with II = 2 cycles

High performance mapping with II = 1 cycles



#### Motivation for Morpher

- CGRA has become popular due to its excellent balance between performance, power efficiency and flexibility
- Researchers are exploring the design space to find a better mix of performance, power efficiency, and flexibility targeting diverse workloads
  - Large architecture design space
  - Workloads have complex kernels
- CGRA design frameworks that can only map simple kernels on a fixed architecture are not suitable for CGRA design space exploration
- Functional verification is important to validate the correctness



### What is missing?

#### Not end-to-end

- No open-source DFG generators (Pillars)
- No open-source simulators (CGRA-ME)
- No open-source architecture adaptive compiler (OpenCGRA, CCF)
- Cannot support diverse CGRA architectures:
  - Does not support architecture adaptive compilation (Open-CGRA, CCF)
  - Only covers basic architectural features in PE array (CGRA-ME)
- Cannot support complex kernels
  - Does not support kernels with control divergence and loops with recurrence edges (CGRA-ME, Pillars)
- Need to manually create test benches and test data for each application kernel

|                         | Features                         | CGRA-ME      | Pillars      | Open-<br>CGRA | CCF          |
|-------------------------|----------------------------------|--------------|--------------|---------------|--------------|
| DFG Generation          | Models control divergence        | Х            | X            | $\checkmark$  | $\checkmark$ |
|                         | Recurrence edges                 | Х            | Х            | $\checkmark$  | $\checkmark$ |
|                         | Adapt user defined architectures | $\checkmark$ | $\checkmark$ | $\checkmark$  | Х            |
| Architecture Modelling  | Multi-hop connections            | X            | Х            | Х             | Х            |
|                         | Different memory organizations   | Х            | Х            | $\checkmark$  | Х            |
|                         | Architecture adaptive mapping    | $\checkmark$ | $\checkmark$ | Х             | Х            |
| P&R Mapper              | Data layout aware mapping        | X            | Х            | Х             | Х            |
|                         | Recurrence aware mapping         | Х            | Х            | $\checkmark$  | $\checkmark$ |
|                         | Cycle accurate simulation        | Х            | $\checkmark$ | $\checkmark$  | $\checkmark$ |
| Simulation & validation | Test data generation             | Х            | Х            | Х             | Х            |
|                         | Validation against test data     | Х            | Х            | Х             | Х            |



# Morpher: An Integrated Compilation and Simulation Framework for CGRA

- Fully automated end-to-end CGRA compilation and simulation framework
- Flexible architecture specification language
  - Allows the user to define complex CGRA architectures
- Supports kernels with control divergence and recurrence edges
  - Allows compiling complex application kernels
- Efficient mapping algorithms
  - Higher mapping quality at shorter compilation time
- Cycle-accurate simulation to validate the compilation results
  - Automatically extract the test data from the target application
- Fully open-source with easily modifiable modular code base:
  - <u>https://github.com/ecolab-nus/morpher</u>

|            | F                       | eatures                          | CGRA-ME      | Pillars      | Open-CGRA    | CCF          | Morpher      |
|------------|-------------------------|----------------------------------|--------------|--------------|--------------|--------------|--------------|
|            | DFG Generation          | Models control divergence        | Х            | Х            | $\checkmark$ | $\checkmark$ | $\checkmark$ |
|            | Di o ocneration         | Recurrence edges                 | Х            | Х            | $\checkmark$ | $\checkmark$ | $\checkmark$ |
|            |                         | Adapt user defined architectures | $\checkmark$ | $\checkmark$ | $\checkmark$ | X            | $\checkmark$ |
|            | Architecture Modelling  | Multi-hop connections            | Х            | Х            | X            | Х            | $\checkmark$ |
| Í          | Architecture modelling  | Different memory organizations   | Х            | Х            | $\checkmark$ | Х            | $\checkmark$ |
|            | P&R Mapper              | Architecture adaptive mapping    | $\checkmark$ | $\checkmark$ | X            | Х            | $\checkmark$ |
| د          |                         | Data layout aware mapping        | Х            | Х            | ×            | Х            | $\checkmark$ |
| -          | i di i mappei           | Recurrence aware mapping         | Х            | Х            | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| Simulation |                         | Cycle accurate simulation        | Х            | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ |
|            | Simulation & validation | Test data generation             | Х            | Х            | X            | Х            | $\checkmark$ |
|            |                         | Validation against test data     | Х            | Х            | Х            | Х            | $\checkmark$ |





#### Overview #define SIZE 20 int A[SIZE], B[SIZE], C[SIZE]; \_\_attribute\_\_((noinline)) 5 void array\_add() { for (int i=0;i<SIZE; i++) {</pre> #ifdef CGRA\_COMPILER please\_map\_me(); Dataflow Target loop #endif Graph C[i] = A[i] + B[i];(2) 10 DFG 11 Generation Application SPM Data 12 source code Layout 13 with annotated 14 int main() { kernel for (int i=0;i<SIZE; i++) {</pre> 15 A[i] = i + 2 + 5;16 3 Data Test data B[i] = i + 3;Generation C[i] = 0;array\_add(); (6)for (int i=0;i<SIZE; i++) printf("%d\n", C[i]);</pre> -Simul22 return 0;







G















#### DFG and Data Layout Generation



- DFG generation: supports DFG generation for multiple execution paradigms
  - Control-flow handling techniques
    - Partial predication, Full predication, Dual-issue
  - Load store address generation techniques
    - On array address generation, decoupled access/execute address generation
  - Memory models
    - Memory-mapped, tightly coupled
- Data layout generation: creates multi-bank data layout by allocating live-in/live-out variables (scalar and arrays) in the CGRA data memory
  - Supports simple data placement policies



#### Abstract Architecture Specifications

- Morpher abstract Architecture Description Language (ADL) is designed to cover diverse CGRA architectures
- Three main components
  - Modules: Represent hardware blocks (PEs, RFs, ALUs, LSUs, and Memories)
    - Primitive modules:
      - Functional Units (FU)
      - Register Files (RF)
      - Memory Units (MU)
  - Ports: Entry and exit points of the modules that carry data
  - Connections: Describe the connectivity among ports
- Multiplexers are automatically inferred through the connections



Architecture nterpreter, au generator Verilog RTL

FPGA Emulation

5

Verilator

(Simulation

#### Abstract Architecture Specifications

 Morpher abstract Architecture Description Language (ADL) is designed to cover diverse CGRA architectures



PE described in Morpher ADL



Scala-based Architecture

(5) CGRA

Verilator

(Simulation

Architecture

nterpreter, aut generator Verilog RTL

FPGA Emulation

#### Abstract Architecture Specifications

- Morpher ADL provides a special syntax to connect the PEs to a given interconnection pattern automatically
  - Eliminates the need for the user to specify all the connections

| PE | PE PE | PE | PE |
|----|-------|----|----|
|    |       |    |    |
| PE | PE -  | PE | PE |
|    |       |    |    |
| PE | PE -  | PE | PE |
|    |       |    |    |
| PE | PE -  | PE | PE |

PEs connected in a 2-D "GRID" Pattern

| "CGRA" :{                                                                                                       |
|-----------------------------------------------------------------------------------------------------------------|
| "SUBMODS" :[                                                                                                    |
|                                                                                                                 |
| { "PATTERN" : "GRID",                                                                                           |
| "DIMS" : {"X" : 4,"Y" : 4},                                                                                     |
| "MODS" : [                                                                                                      |
| {"X":0,"Y":0,"MOD":"PE_MEM"}, {"X":1,"Y":0,"MOD":"PE"}, {"X":2,"Y":0,"MOD":"PE"}, {"X":3,"Y":0,"MOD":"PE"},     |
| {"X":0,"Y":1,"MOD":"PE_MEM"}, {"X":1,"Y":1,"MOD":"PE"}, {"X":2,"Y":1,"MOD":"PE"}, {"X":3,"Y":1,"MOD":"PE"},     |
| {"X":0,"Y":2,"MOD":"PE_MEM"}, {"X":1,"Y":2,"MOD":"PE"}, {"X":2,"Y":2,"MOD":"PE"}, {"X":3,"Y":2,"MOD":"PE"},     |
| {"X":0,"Y":3,"MOD":"PE_MEM"}, {"X":1,"Y":3,"MOD":"PE"}, {"X":2,"Y":3,"MOD":"PE"}, {"X":3,"Y":3,"MOD":"PE"}      |
| ],                                                                                                              |
| "CONNECTIONS" : [                                                                                               |
| {"FROM_X" : "X", "FROM_Y" : "Y", "FROM_PORT" : "NORTH_0", "TO_X" : "X", "TO_Y" : "Y-1", "TO_PORT" : "SOUTH_I"}, |
| {"FROM_X" : "X", "FROM_Y" : "Y", "FROM_PORT" : "EAST_O", "TO_X" : "X+1", "TO_Y" : "Y", "TO_PORT" : "WEST_I"},   |
| {"FROM_X" : "X", "FROM_Y" : "Y", "FROM_PORT" : "WEST_O", "TO_X" : "X-1", "TO_Y" : "Y", "TO_PORT" : "EAST_I"},   |
| {"FROM_X" : "X", "FROM_Y" : "Y", "FROM_PORT" : "SOUTH_O", "TO_X" : "X", "TO_Y" : "Y+1", "TO_PORT" : "NORTH_I"}  |
| ]                                                                                                               |
| }                                                                                                               |
| ]                                                                                                               |
|                                                                                                                 |

Syntax for the "GRID" Pattern (https://github.com/ecolab-nus/morpher/Morpher\_CGRA\_Mapper/json\_arch/stdnoc.json)



Architecture

nterpreter, aut generator Verilog RTL

FPGA Emulation

3

(5) CGRA

Verilator

(Simulation

#### CGRA Mapper

- Mapper generates the configurations of hardware elements (ALU, Mux, RF read/write)
  - Goal of the mapper is to generate a mapping with the lowest possible II (Minimum II)
- Algorithm:
  - Minimum II is calculated
    - based on the CGRA array size, number of DFG nodes, and the recurrence dependencies in the kernel
  - Create Modulo Routing Resource Graph (MRRG) with MII
  - Create the initial mapping allowing resource overuse:
    - Create a scheduling list: sort the nodes of the DFG in a topological ordering
    - Map each DFG node to a MRRG node with the least accumulated routing cost
  - Iteratively resolves the resource overuse
  - Increase II and redo the mapping till a feasible schedule is obtained





#### CGRA Mapper



- Morpher currently supports three approaches to resolve resource overuse
  - Adaptive-heuristic-based approach inspired by SPR
    - Cost of the over-subscribed ports is increased for the next mapping iterations
    - Resources with more demand would be available for routing the dependencies with fewer options
  - Simulated Annealing (SA) based approach
    - Node placement is changed based on an SA-based cooling schedule
  - Learning-induced (LISA) approach [1]
    - Node placement is guided by labels inferred from a trained Graph Neural Network (GNN) model
- Our modular code base allows researchers to add their mapping methods
  - In the future, we plan to incorporate hierarchical mapping approaches for better scalability [2, 3]

[3] D. Wijerathne, Z. Li, T. K. Bandara, and T. Mitra, "PANORAMA: Divide-and-Conquer Approach for Mapping Complex Loop Kernels on CGRA," in Proceedings of the 59th Annual Design Automation Conference 2022



<sup>[1]</sup> Z. Li, D. Wu, D. Wijerathne, and T. Mitra, "LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators," in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

<sup>[2]</sup> D. Wijerathne, Z. Li, A. Pathania, T. Mitra, and L. Thiele, "HiMap: Fast and scalable high-quality mapping on CGRA via hierarchical abstraction," in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE).

#### **Test Data Generation**



#### Instrument, i.e., insert data recording functions to the C source code to



#### Simulation and Verification



- Morpher Simulator is a model of the CGRA composed of Fus, registers, multiplexers, and memories
- Currently Morpher simulator only models the variations of HyCUBE CGRA architecture
- CGRA model acts as a memory-mapped slave device to a host processor
  - The live-in variables are loaded for each memory unit
  - Simulator executes operations mapped on FUs, multiplexes the data, and writes data to registers/memories on a cycle basis according to the mapping configurations
  - Post-simulation memory content is validated against the expected results



#### **Open-source** Artifact Demonstration





#### Experimental Study

- 1. Mapping quality and compilation time
- 2. Map real applications kernels with automated verification
- 3. The ability to model diverse CGRA architectures



### Mapping Quality and Compilation Time

- Comparison with CGRA-ME, the only open-source framework available for architecture-adaptive compilation
- Target architecture: 4x4 CGRA
  - Interconnected in N2N connections
  - Each PE has RF with four registers
- Kernels: Polybench benchmark suite
- Results:
  - Morpher achieves the minimum possible II for all kernels, while CGRA-ME failed to do so for five kernels.
  - Morpher LISA mapping is 13.3x faster than CGRA-ME on average.



Quality of Mapping (min II/ Achieved II)



Compilation Time (s)



### Accelerating Real Applications on CGRA

- This study demonstrates
  - How Morpher can be used to accelerate real application kernels on CGRA-based systems
  - The importance of supporting control divergence and recurrence edges
- Target CGRA: 4x4 HyCUBE with a partial predicationbased execution model
- Target application: Microspeech (wake-word detection) from tinyML benchmark suite
  - Target kernel: Convolution layer



HyCUBE CGRA





### Accelerating Real Applications on CGRA

- This study demonstrates
  - How Morpher can be used to accelerate real application kernels on CGRA-based systems
  - The importance of supporting control divergence and recurrence edges
- Target CGRA: 4x4 HyCUBE with a partial predicationbased execution model
- Target application: Microspeech (wake-word detection) from tinyML benchmark suite
  - Target kernel: Convolution layer
  - Convolution layer is lowered to GEMM kernel



Neural Network Model



#### Accelerating Real Applications on CGRA

- We consider three variations of the GEMM kernel
  - GEMM
  - GEMM-U: unrolled four times
  - GEMM-U-F: unrolled and flattened
    - Flattening introduces conditional statements inside the loop body and adds a long recurrence edge
    - Flattened kernel does not have invocation overheads
- Results:
  - Morpher can successfully compile all three kernels
  - GEMM-U kernel reduces the compute time by nearly half compared to the GEMM
  - GEMM-U-F offers the best total execution time
  - Automatically verifies the compiled kernels by running simulations using pre-processed audio data extracted from the application

| O[i][j] += W[i][k] * I[k][j];                                                                   |
|-------------------------------------------------------------------------------------------------|
|                                                                                                 |
| for (i=0;i <r1; i++)<="" th=""></r1;>                                                           |
| <pre>for (j=0; j<c2; j++)<="" pre=""></c2;></pre>                                               |
| <pre>for (k=0;k<c1; (gemm-u)<="" k="k+4):" map="" pre="" this=""></c1;></pre>                   |
| O[i][j] += W[i][k] * I[k][j]                                                                    |
| + W[i][k+1]* I[k+1][j]                                                                          |
| + W[i][k+2]* I[k+2][j]                                                                          |
| + W[i][k+3]* I[k+3][j];                                                                         |
|                                                                                                 |
| <pre>for (n=0;n<r1*c2*c1; (gemm-u-f)<="" map="" n++)="" pre="" this="" {:=""></r1*c2*c1;></pre> |
| O[i][j] += W[i][k] * I[k][j]                                                                    |
| + W[i][k+1]* I[k+1][j]                                                                          |
| + W[i][k+2]* I[k+2][j]                                                                          |
| + W[i][k+3]* I[k+3][j];                                                                         |
| k+=4;                                                                                           |
| <pre>if(k+1&gt;=C1) {k=0; ++j;}</pre>                                                           |
| <pre>if(j==C2) {j=0; ++i;}</pre>                                                                |
| }                                                                                               |
| GEMM kernels                                                                                    |
| GEIMIM Kernels                                                                                  |
|                                                                                                 |
|                                                                                                 |

or (k=0;k<C1; k++): //map this (GEMM)

2 for (j=0; j<C2; j++)</pre>

| Kernel   | Nodes | II (MII) | Compute<br>time (ms) | Data<br>transfer<br>time (ms) | Total<br>execution<br>time (ms) |
|----------|-------|----------|----------------------|-------------------------------|---------------------------------|
| GEMM     | 26    | 4 (4)    | 2.70                 | 3.39                          | 6.09                            |
| GEMM-U   | 58    | 6 (4)    | 1.17                 | 3.39                          | 4.56                            |
| GEMM-U-F | 79    | 8 (8)    | 1.31                 | 1.79                          | 3.10                            |

#### **Execution time**





#### Modelling Complex CGRAs

|                                  |        | а  | dpcm                |    | aes                 |    | dct                 |    | fft                 |    | gsm                 |
|----------------------------------|--------|----|---------------------|----|---------------------|----|---------------------|----|---------------------|----|---------------------|
| MII                              |        |    | 7 10                |    | 9                   |    | 4                   |    | 6                   |    |                     |
| Architecture case studies        |        | II | Compile<br>Time (m) |
| Generic CGRA                     |        | 17 | 323                 | 24 | 410                 | 23 | 367                 | 11 | 467                 | 10 | 409                 |
| ALU-Independent Routing          |        | 9  | 14                  | 19 | 79                  | 15 | 22                  | 5  | 16                  | 8  | 88                  |
| Multi-hop Routing                | 2-Hops | 9  | 8                   | 15 | 17                  | 12 | 12                  | 5  | 2                   | 8  | 96                  |
|                                  | 3-Hops | 9  | 4                   | 13 | 5                   | 10 | 8                   | 5  | 10                  | 7  | 39                  |
|                                  | 4-Hops | 8  | 1                   | 13 | 3                   | 11 | 4                   | 5  | 10                  | 10 | 175                 |
| Multi-Hop Scattered<br>Registers | 4-Hops | 8  | 1                   | 12 | 2                   | 11 | 3                   | 5  | 7                   | 7  | 18                  |



#### Conclusion

- Morpher CGRA compilation and simulation framework
  - Flexible to model modern CGRA architectures
  - Map complex workloads with a higher mapping quality at a shorter compilation time
  - Automatically validate the correctness of the mapping results through cycleaccurate simulation
- Fully open-source
  - Modular codebase
  - Easy to modify
- Open source repository:
  - https://github.com/ecolab-nus/morpher









## Thank you!

