### TAPAS: Generating Parallel Accelerators from Parallel Programs

Steven Margerm<sup>1</sup>, **Amirali Sharifian<sup>1</sup>**, Apala Guha<sup>1</sup> Gilles Pokam<sup>2</sup>, Arrvindh Shriraman<sup>1</sup>



Simon Fraser University<sup>1</sup>, Intel Corp.<sup>2</sup>



# Motivation



#### FPGAs are everywhere

- Lots parallelism
  - 150\$ Cyclone V SoC 60 stencil tasks
- 10s of cycles for invoking a hardware "task"
- Fine-grain parallelism
  - Cyclone V. 512 arithmetic ops

# **High Level Synthesis**



- Mixes schedule and algorithm
   #pragma
- Static schedule
   limited concurrency control
- Domain specific templates
   generalizable ?

### **TAPAS:** Auto generating **Parallel Dataflow Accelerator**



MIT's parallel compiler(TAPIR)



- Hardware component library:
  - like UCB Rocket, but for accelerators



- Generator
  - synthesizing RTL from compiler IR





• HLS Challenge: Static Parallelism

• TAPAS : modular high level synthesis

TAPAS: generating task units



## HLS Challenge: Static Parallelism

### **Unrolled Program**

#pragma UNROLL 2

}

- for(i = 0 until n){
  - if(node[i].valid){
     compute(&node[i]);



### Worst case schedule —> Low utilization

### Our Approach: Dynamic Parallelism



### Run time schedule -> High utilization











- Static Task Graph cilk\_for(i = 0 until n){ cilk\_for(j = 0 until n){ c[i][j] = a[i][j]+b[i][j]; )for\_i TO } Parallel Compiler for\_j Captures <u>Spawn</u> and <u>Sync</u> from IR
- Task Extractor:

Wraps each task in a first class entity

body

## **Task-Level Architecture**



**Heterogeneous!** 

**Nested Parallel!** Asynchronous!





















• What does task pipelining look like in TAPAS?

#### Dedup



• What does task pipelining look like in TAPAS?





• What does task pipelining look like in TAPAS?



• What does task pipelining look like in TAPAS?

Dedup





What are the element inside each TXU?



c[i][j] = a[i][j]+b[i][j];



c[i][j] = a[i][j]+b[i][j];





c[i][j] = a[i][j]+b[i][j];



# Experiment

Board:
Arria 10 SOC
Intel core i7



- Execution time reported
   Number of Cycles
- •Goal:
  - -Performance/watt improvement
  - Reducing overhead of spawning tasks with few instructions

How does performance scale with workload size?

 Unlike a CPU, FPGA performance scales with **#TXU** even for *fine grained* parallelism.



### Does performance scale with recursion?

Performance scales with recursive algorithms



### How does performance compare to CPU?

Performance gain compare to a Intel core i7



#### How does Performance/Watt compare to CPU?

Performance/Watt has significant improvement



What is the overhead of task controller?

#### ALM Utilization by Sub-block



# Available now https://github.com/sfu-arch/tapas

# Thanks Chisel and Tapir folks

### Shout out to related...

- An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware (MICR051)
- Dynamically scheduled high-level synthesis (FPGA18)

### Parametrization and Configuration

- TAPAS generated accelerator is <u>Parametrizable</u> and <u>Configurable</u>.
  - The number of TXUs can be set specifically for each task base on different criteria.
  - Datapath width can be set at this phase, supporting mixed precision as well.
  - Memory modules within each Task Unit are configurable like scratchpads, network and cache.