

INSTITUT NATIONAL DES SCIENCES APPLIQUÉES **RENNES** 

# DATAFLOW CODE GENERATION FOR FPGA

Alexandre Honorat, <u>Mickaël Dardaillon</u>, Jean-François Nezan INSA Rennes / IETR, CNRS UMR 6164

Work in progress





- High throughput / low latency
- Low energy



2



- High throughput / low latency
- Low energy



CPU



3





- High throughput / low latency
- Low energy







- High throughput / low latency
- Low energy







- Natural expression for signal and image processing







- Natural expression for signal and image processing









- Natural expression for signal and image processing









Natural expression for signal and image processing -





UNIVERSIT BRETAGN LOIRE

9



- Natural expression for signal and image processing





UNIVERSITE BRETAGNE LOIRE



- Natural expression for signal and image processing





- High throughput
- High memory usage
- High bandwidth usage
- Long latency

| GaussianBlur1 | GaussianBlur1 | GaussianBlur1 |               |
|---------------|---------------|---------------|---------------|
|               | GaussianBlur2 | GaussianBlur2 | GaussianBlur2 |
|               |               | Difference    | Difference    |





- Natural expression for signal and image processing









- Natural expression for signal and image processing





13

UNIVERSITE BRETAGNE LOIRE



- Natural expression for signal and image processing







- Natural expression for signal and image processing





15

UNIVERSITE BRETAGNE LOIRE



- Natural expression for signal and image processing
- Pipeline implementation good match for FPGA





- Natural expression for signal and image processing
- Pipeline implementation good match for FPGA
- Native support in Vitis HLS

void top\_graph(hls::stream<int> &input, hls::stream<int> &output) {
 // FIFOs

static hls::stream<int> GaussianBlur1ToDuplicate;
#pragma HLS stream variable=GaussianBlur1ToDuplicate depth=2

// Kernels

#pragma HLS DATAFLOW

GaussianBlur1(input, GaussianBlur1ToDuplicate); Duplicate(GaussianBlur1ToDuplicate, DuplicateToGaussianBlur2,

#### DuplicateToDifference);

GaussianBlur2(DuplicateToGaussianBlur2, GaussianBlur2ToDifference); Difference(GaussianBlur2ToDifference, DuplicateToDifference, output);

```
}
```

```
BRETAGNE
LOIRE
```



# **Challenge for design productivity**

- Express dataflow at high level
- Optimized hardware implementation



18



# **Challenge for design productivity**

- Express dataflow at high level
- Optimized hardware implementation

# Contributions

- Dataflow code generation for FPGA using HLS
- Automatic scheduling and buffer sizing
- Open source implementation in PREESM







# DATAFLOW CODE GENERATION FOR FPGA USING HLS











- Parameterized
- Interfaced
- Synchronous Dataflow







- Parameterized
- Interfaced
- Synchronous Dataflow
  - Multirate







- Parameterized
- Interfaced
- Synchronous Dataflow
  - Multirate
  - Repetition factor
  - Deadlock free
  - Automatic scheduling and buffer sizing















#### User provided

- Kernels
  - HLS
  - FIFO interface (stream<T>)



26





Kernels + Graph

#### **User provided**

- Kernels
  - HLS
  - FIFO interface (stream<T>)

# **Code generation**

- Graph
  - Multirate
  - Cyclic
  - Self-scheduled ASAP









Input

Kernels + Graph

Output

28

# User provided

- Kernels
  - HLS
  - FIFO interface (stream<T>)

# **Code generation**

- Graph
  - Multirate
  - Cyclic
  - Self-scheduled ASAP

# - Input / output

- Array interface (T\*)
- Batch transfer from/to RAM
- Scheduled by host







# Input

#### Kernels + Graph

#### Output

# **User provided**

- Kernels
  - HLS
  - FIFO interface (stream<T>)

# **Code generation**

- Graph
  - Multirate
  - Cyclic
  - Self-scheduled ASAP

# - Input / output

- Array interface (T\*)
- Batch transfer from/to RAM
- Scheduled by host
- Host code
  - OpenCL
  - PYNQ
  - Bare metal
  - Testbench













































# AUTOMATIC SCHEDULING AND BUFFER SIZING













#### Acquire – Release

• Shared memory sync.







#### **Acquire – Release**

Shared memory sync. 







# **Problem: matching actor and hardware execution model Acquire – Release**

• Shared memory sync.



40





Acquire – Release

Shared memory sync.



SDF Access Pattern [Tripakis et al. 11]

- Cycle accurate info
- Scalability issues





Acquire – Release

Shared memory sync.



SDF Access Pattern [Tripakis et al. 11]

- Cycle accurate info
- Scalability issues







- $\tau_p = 4$  tokens per execution
- *II* = 8 *cycles*
- $a_p = 0.5 token/cycle$





- $\tau_p = 4$  tokens per execution
- II = 8 cycles
- $a_p = 0.5 token/cycle$







- $\tau_p = 4$  tokens per execution
- *II* = 8 *cycles*
- $a_p = 0.5 token/cycle$







Affine approximation

#### **Metrics:**

- $\tau_p = 4$  tokens per execution
- II = 8 cycles
- $a_p = 0.5 token/cycle$

 $cc - \lambda p l \lambda \lambda \lambda p l pp \lambda p l l l \lambda p l$   $\leq pp c cc c \leq a p aa a p pp a p$   $\times cc + \lambda p u \lambda \lambda \lambda p u pp \lambda p u uu \lambda$  p u11



LOIRE



#### **Metrics:**

- $\tau_p = 4$  tokens per execution
- *II* = 8 *cycles*
- $a_p = 0.5 token/cycle$



47



- $\tau_p = 4$  tokens per execution
- II = 8 cycles
- $a_p = 0.5 token/cycle$







Affine approximation

#### **Metrics:**

- $\tau_p = 4$  tokens per execution
- II = 8 cycles
- $a_p = 0.5 token/cycle$

 $\tau p c - \lambda p \leq p c \leq a p \times c + \lambda p$   $p p p p p p p p p p \tau p 1 - \tau p$   $II 11 - \tau p II \tau p \tau \tau \tau p p \tau p$   $\tau p II IIII \tau p II 1 - \tau p II$ 





LOIR



**ILP constraints:** 

- : FIFO size
- $\theta$ : delays
- *φ*: phase





(1) 
$$\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{n} \ge \lambda_c^u + \lambda_p^l + a_p C_{under}$$

- **ILP constraints:**
- 1. Underflow

- : FIFO size
- $\theta$ : delays
- *φ*: phase





(1) 
$$\theta_{e_{p \to c}} + a_p \frac{\varphi_{p \to c}}{n} \ge \lambda_c^u + \lambda_p^l + a_p C_{under}$$

(2) 
$$\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{d} \le \delta_{p\to c} - \lambda_c^u - \lambda_p^l - a_p C_{over}$$

- **ILP constraints:**
- 1. Underflow
- 2. Overflow

- : FIFO size
- θ: delays
- *φ*: phase





**ILP constraints:** 

- 1. Underflow
- 2. Overflow
- 3. Cycles

(1) 
$$\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{n} \ge \lambda_c^u + \lambda_p^l + a_p C_{under}$$

(2) 
$$\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{d} \le \delta_{p\to c} - \lambda_c^u - \lambda_p^l - a_p C_{over}$$

 $\frac{n}{d} = \frac{\tau_p}{II_p} \times \frac{II_c}{\tau_c}$ 

(3) 
$$\sum_{i=1}^{k} \left( \prod_{l=1}^{i-1} d_l \right) \left( \prod_{l=i+1}^{k} n_l \right) \varphi_i = 0$$

- : FIFO size
- θ: delays
- *φ*: phase





(1)  $\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{n} \ge \lambda_c^u + \lambda_p^l + a_p C_{under}$ 

(2) 
$$\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{d} \le \delta_{p\to c} - \lambda_c^u - \lambda_p^l - a_p C_{over}$$

1. Underflow

**ILP constraints:** 

- 2. Overflow
- 3. Cycles

# Formally proven (A. Bouakaz thesis)

- $\delta$ : FIFO size
- θ: delays
- *φ*: phase

$$\frac{n}{d} = \frac{\tau_p}{II_p} \times \frac{II_c}{\tau_c}$$

(3) 
$$\sum_{i=1}^{k} \left( \prod_{l=1}^{i-1} d_l \right) \left( \prod_{l=i+1}^{k} n_l \right) \varphi_i = 0$$





(1)  $\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{n} \ge \lambda_c^u + \lambda_p^l + a_p C_{under}$ 

(2) 
$$\theta_{e_{p\to c}} + a_p \frac{\varphi_{p\to c}}{d} \le \delta_{p\to c} - \lambda_c^u - \lambda_p^l - a_p C_{over}$$

1. Underflow

ILP constraints:

2. Overflow

3. Cycles

# Formally proven (A. Bouakaz thesis)

- $\delta$ : FIFO size
- $\theta$ : delays
- *φ*: phase

# Fast for ILP formulation (3 variables per edge) Guarantee optimal throughput (no push back by overflow)

(3) 
$$\sum_{i=1}^{k} \left( \prod_{l=1}^{i-1} d_l \right) \left( \prod_{l=i+1}^{k} n_l \right) \varphi_i = 0$$

 $\frac{n}{d} = \frac{\tau_p}{II_n} \times \frac{II_c}{\tau_c}$ 





# **EXPERIMENTAL RESULTS**













- Vitis Library kernels + graph
- PREESM computed FIFO sizes



58





# Setup:

- Vitis Library kernels + graph
- PREESM computed FIFO sizes

|                        | Vitis Library | PREESM             |
|------------------------|---------------|--------------------|
| [Undirected cycle]     | 15360         | 8580               |
| [FIFO]                 | 2             | 1276 - 7022        |
| BRAM                   | 18            | 35 (+49%)          |
| Latency (Synthesis)    | 19914         | 19914              |
| Latency (Cosimulation) | 20870         | 19827 <b>(-5%)</b> |





# **Gaussian Difference**

Reduce kernel latency

| RELOW |
|-------|
| 2541  |
| 2540  |
| 12    |
|       |







# **Gaussian Difference**

Reduce kernel latency

ROWS

|   |                | Vitis + FIFO | PREESM |
|---|----------------|--------------|--------|
| у | [Undir. cycle] | 8580         | 2541   |
|   | [FIFO]         | 1276 - 7022  | 2540   |
|   | BRAM           | 35           | 12     |
|   |                |              |        |



# **Color Filter**

COLS

 Move downsampling to small kernel

|                | PREESM     | Optimized |
|----------------|------------|-----------|
| [Undir. cycle] | 808011     | 38002     |
| [FIFO]         | 2 - 520936 | 2 - 5082  |
| BRAM           | 790        | 61        |







### **2D Wavelet Transform**

 Reduce actor granularity

|        | PREESM     | Optimized |
|--------|------------|-----------|
| [FIFO] | 12 - 50897 | 12 - 1510 |
| BRAM   | 368        | 108       |





# CONCLUSION





#### **Contributions**

- Dataflow code generation for FPGA using HLS
- Automatic scheduling and buffer sizing
- Open source implementation in PREESM







#### Contributions

- Dataflow code generation for FPGA using HLS
- Automatic scheduling and buffer sizing
- Open source implementation in PREESM

# Future work

- Improve buffer sizing
  - Data dependency
  - Actor internal scheduling







### Contributions

- Dataflow code generation for FPGA using HLS
- Automatic scheduling and buffer sizing
- Open source implementation in PREESM

# Future work

- Improve buffer sizing
  - Data dependency
  - Actor internal scheduling

# - Target heterogeneous platform

- FPGA + CPU
- Mapping + Scheduling + Code generation





67

### Contributions

- Dataflow code generation for FPGA using HLS
- Automatic scheduling and buffer sizing
- Open source implementation in PREESM

# Future work

- Improve buffer sizing
  - Data dependency
  - Actor internal scheduling
- Target heterogeneous platform
  - FPGA + CPU
  - Mapping + Scheduling + Code generation
- Design Space Exploration
  - Constraints based optimization
  - Memory optimization

