



A Software Developer's Journey into a Deeply Heterogeneous World

Tomas Evensen, CTO Embedded Software, Xilinx

### Embedded Development: Then

- Simple single CPU
- Most code developed internally
  - 10's of thousands of lines of code in C and assembly
- Single Real-time Operating System
- > JTAG/BDM debugger
- Simple I/O



| 🔉 📔 😇 Sour | <u></u>          | _    |             |                        |   |                        |
|------------|------------------|------|-------------|------------------------|---|------------------------|
|            |                  |      | 4 Þ         | K A Mixed              |   |                        |
|            |                  |      |             |                        |   |                        |
|            | #54<br>#55       |      | si.s        | <u>.</u>               |   | CI- indem to flows on  |
| 0          | 02000204:        | xor  |             | r22,r0,0,16,31         | ; | SI= index to flags ar  |
|            | #56              | xor  |             |                        |   | DI = primes counter    |
|            | 02000208:        | 201  |             | r23,r0,0,16,31         | , | DI primeo councer      |
|            | #57              |      |             |                        |   |                        |
|            | #58 sieve2:      |      |             |                        | ; | main loop of sieve     |
|            | #59              | test |             | ptr flags[si],1        | ; | is this a prime?       |
|            | D200020C:        |      |             | r22,r22,0,16,31        |   |                        |
|            | 02000210:        |      | lbzx        | r8,r24,r22             |   |                        |
|            | 02000214:        | ÷    | andi.       | r10,r8,1               | _ | iuma i Caraina         |
|            | #60<br>02000218: | jnz  | snor<br>bne | t sieve4<br>0x02000244 | ; | jump if prime          |
| ¥*         | #61              |      | pue         | 0x02000244             |   |                        |
|            | #62 sieve3:      | inc  | si          |                        |   | bump to next slot in " |
|            | 0200021C:        |      | addic       | r22,r22,1              |   |                        |
| EAK        | #63              | cmp  | si,a:       | size                   | ; | are we done?           |
| 196        | 02000220:        |      | extsh       | r4,r22                 |   |                        |
|            | 02000224:        |      | cmpi        | crf0,0,r4,0x1FFE       |   |                        |
|            | #64              | jle  | . siev      |                        | 2 | jump to test another   |
|            | 02000228:        |      | bng         | 0x0200020C             |   |                        |
|            | #65<br>#66       | dec  | we and      | ptr counter            |   | more iterations?       |
|            | 02000220:        | uec  | lhz         | r8,0x1FFF(r24)         |   | more iterations?       |
|            | 02000220.        |      | 1112        | 10,081111(124)         |   |                        |

© Copyright 2016 Xilinx

**EXILINX >** ALL PROGRAMMABLE.

## Embedded Development: Now

- > Multiple heterogeneous CPUs
- Multiple accelerators and programmable logic
- > Millions of lines of code Mostly from other places like open source
- Multiple Operating Systems (i.e. Linux + RTOS)
- JTAG debugger
- Safety and Security concerns



XILINX ➤ ALL PROGRAMMABLE.

Xilinx Zynq MPSoC

## **Dedicated Hardware is Energy Efficient**



based on published results at ISSCC conferences.

### Heterogeneous Example: IIoT Gateway



#### Expertise Needed All the Way from a System Level to Cloud Connectivity

### FPGA – The "Chameleon" Chip



### FPGA – Reaching New Developers

#### > Limited pool of FPGA developers

- Need to reach software developers
- Software developers are different!
- > Key to reach software developers
  - 1. Create libraries so they can utilize accelerators written by others
  - 2. Create tools so they can utilize FPGA without RTL



© Copyright 2016 Xilinx

b) t<0

b) [t<0]

## Heterogeneous Software Development



### Mapping Applications to Heterogeneous Systems



© Copyright 2016 Xilinx

### Components for Heterogeneous SW Development

- > Accelerated libraries and frameworks for common functions
  - E.g. OpenCV, CNN, ...

#### > Support for multiple types of Operating Environments

- Solid Linux support, bare metal, FreeRTOS, 3rd party RTOS, Windows EC
- Mixing of OS's through AMP and hypervisors

#### System debugger – Unifying debug/profile

Debug across cores and FPGA including profiling and trace

#### > FPGA Compiler – SDSoC

- Write code for FPGA using C/C++/OpenCL
- Automate the "glue" between execution engines

#### > Other

- Virtual Prototyping for complete system



### Framework Programming: Deep Learning

- > Many embedded problems are being converted to use deep learning
  - Embedded vision, speech, ...
  - Using neural networks of different kinds, e.g. CNN, ...
- > Neural networks are "programmed" through learning
- > Neural networks are typically controlled by frameworks
  - Caffe, Tensorflow, Torch, Theano, ...
- > Neural networks are very computation intensive

#### > FPGAs can be very efficient for neural networks

- Combination of fixed point, flexible routing, memory hierarches and DSPs
- By supporting existing framework, programmers can avoid RTL



|       | Output<br>Feature<br>Maps |      |       | Filter<br>Sizing |       |       | MACs        |
|-------|---------------------------|------|-------|------------------|-------|-------|-------------|
|       | Rows                      | Cols | Depth | Dim              | Depth |       | conv        |
| conv1 | 55                        | 55   | 64    | 11               | 3     |       | 70,276,800  |
| conv2 | 27                        | 27   | 192   | 5                | 64    |       | 223,948,800 |
| conv3 | 13                        | 13   | 384   | 3                | 192   |       | 112,140,288 |
| conv4 | 13                        | 13   | 256   | 3                | 384   |       | 149,520,384 |
| conv5 | 13                        | 13   | 256   | 3                | 256   |       | 99,680,256  |
| fc6   | 6                         | 6    | 256   |                  | 4096  |       | 37,748,736  |
| fc7   | 1                         | 1    | 4096  |                  | 4096  |       | 16,777,216  |
| fc8   | 1                         | 1    | 4096  |                  | 1000  |       | 4,096,000   |
|       |                           |      |       |                  |       | Total | 714,188,480 |

Page 11

© Copyright 2016 Xilinx

**AlexNet Calculations** 

### **OpenAMP: A Standard for Multi-OS Systems**

- > What is OpenAMP?
  - A standard for mixing embedded Operating Systems
  - An Open Source project
- > Trend to combine Operating Systems
  - Linux is used in majority of use cases
  - Many free and commercial RTOS's are being used
  - Bare metal (no OS) is common on smaller cores
- > Why multiple Operating Systems?
  - Heterogeneous cores
  - Different needs
    - Real-time vs. general purpose
    - Different Safety/Security levels
    - Legacy
    - GPL avoidance
- > Safety and Security issues common
  - Affects boot order, messaging implementation, ...







- Examples of OpenAMP applications



### **OpenAMP** Capabilities



## Software Development Tools (SDK)

#### 2015 UBM Electronics Embedded Markets Study

#### What are the most important factors in choosing a processor?

#### > Complete system visibility needed

- Heterogeneous debugging and analysis is very hard!
- Especially timing related problems

#### > Tools Features:

- Heterogeneous system Level Debugging
  - Visibility into both CPUs and FPGA
- Integrated performance profiling
  - Which parts of the chip are busy?
  - Measure processor and bus activities
  - Integrated traffic generator
- System event trace
  - What is happening in the chip over time?
  - Combined time line for SW and HW events
- Based on standards Open source Eclipse, TCF



71% - Software

**Development tools** 

#### Strong system level tools are critical for heterogeneous development

© Copyright 2016 Xilinx

XILINX ➤ ALL PROGRAMMABLE.

| PMU Performance Graphs 2 Perform               | nance Counters 🛛 🗶 AX | Performance Graphs                                                                                                                                                  |                                                                                                                                                                                                                        |                                |                                                | Live tables                                                  |
|------------------------------------------------|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|------------------------------------------------|--------------------------------------------------------------|
| ARM Performance Counters                       |                       |                                                                                                                                                                     |                                                                                                                                                                                                                        |                                |                                                | <ul> <li>ARM performance registers</li> </ul>                |
|                                                | CPU0                  | CPU1                                                                                                                                                                |                                                                                                                                                                                                                        |                                |                                                |                                                              |
| 1 Data Cache Miss Rate(%)                      | 0.02114421            | 0.0                                                                                                                                                                 |                                                                                                                                                                                                                        |                                |                                                | <ul> <li>Cache misses, IPC, …</li> </ul>                     |
| Data Cache Access                              | 5.013177678E9         | 0.0                                                                                                                                                                 |                                                                                                                                                                                                                        |                                |                                                | , ,                                                          |
| 2 Data Cache Miss Rate(%)                      | 87.19937              | -                                                                                                                                                                   |                                                                                                                                                                                                                        |                                |                                                | <ul> <li>AXI performance registers</li> </ul>                |
| 2 Data Cache Access                            | 859520.0              | -                                                                                                                                                                   |                                                                                                                                                                                                                        |                                |                                                |                                                              |
| PU Stall Cycles Per Write Instruction          | 0.200464              | 0.0                                                                                                                                                                 |                                                                                                                                                                                                                        |                                |                                                | <ul> <li>Transactions, latency, …</li> </ul>                 |
| PU Stall Cycles Per Read Instruction           | 0.0498987             | 0.0                                                                                                                                                                 |                                                                                                                                                                                                                        |                                |                                                |                                                              |
| XI Performance Counters                        |                       |                                                                                                                                                                     |                                                                                                                                                                                                                        |                                |                                                | <ul> <li>Non-intrusive JTAG profiling</li> </ul>             |
|                                                | HP0                   | HP1                                                                                                                                                                 | HP2                                                                                                                                                                                                                    | HP3                            | ACP                                            |                                                              |
| rite Transactions                              | 9.8948214E7           | 9.8948226E7                                                                                                                                                         | 1.03853698E8                                                                                                                                                                                                           | 1.03853682E8                   | 1.16285522E8                                   |                                                              |
| erage Write Latency                            | 197.4663              |                                                                                                                                                                     | andalone_bsp_0/ps7_cortexa9_0/libsrc/standa                                                                                                                                                                            |                                |                                                |                                                              |
| ite Latency - Std Dev                          | 1.945468              | File Edit Source Refactor Na                                                                                                                                        | vigate Search Project Xilinx Tools Run                                                                                                                                                                                 |                                | •                                              | -d Quick Access<br>Quick Access<br>B □ □ C/C++ □ Performance |
| te Throughput (MB/sec)                         | 96.2568               | # Debug 12 Debug 12                                                                                                                                                 |                                                                                                                                                                                                                        |                                |                                                | DB Console % Breakpoints 🖾 Console & Terminal 1 🖬            |
|                                                |                       | 🔲 🎍 🖻 Performance Analysis usi                                                                                                                                      |                                                                                                                                                                                                                        |                                | Serial: (COM5, 115200, 8, 1, None, Nor         | ne - CONNECTED) - Encoding: (ISO-8859-1)                     |
| ad Transactions                                | 1.08894021E8          | # 20 APU                                                                                                                                                            |                                                                                                                                                                                                                        |                                | 2-D FIR Fi<br>Floating-Point Matrix Multip     | plier   1   0.42   55.82   9202.27                           |
| erage Read Latency                             | 175.2481              | ARM Cortex-A9 MF                                                                                                                                                    | Core #0 (Suspended)                                                                                                                                                                                                    | -                              | Integer Matrix Multip                          |                                                              |
| ad Latency - Std Dev<br>ad Throughput (MB/sec) | 1.609473<br>105.9321  | APU Performance Summary                                                                                                                                             | PL Performance PS Performance 12                                                                                                                                                                                       | MicroBlaze Performance Summary | MicroBlaze Performance<br>CPU Instructions Per | <u>به 🖞 🖞 🔤 🗠 او</u>                                         |
| Timeline<br>Correlate perfo<br>• Cache, buss   | ormance               | CPU Usili<br>X-Axis: Plotting Interval Soms-<br>clock cycles<br>Y-Axis: CPU Usilization(%) (50)<br>Description: CPU Usilization(%)<br>number of active clock cycles | -33 million ARM processor<br>ns interval) 50.0 -                                                                                                                                                                       | è ;                            | 8 9 10<br>Elapsed Time (sr<br>— CPU0 — CPU1    |                                                              |
|                                                |                       | by total number of CPU clock<br>CPU Instructi<br>X-Axis: Plotting Interval 50ms<br>clock cycles                                                                     | ons Per Cycle                                                                                                                                                                                                          | 11                             |                                                | 1                                                            |
| Examples:<br>• How does AC                     |                       | Y-Axis: CPU Instructions Per Cy                                                                                                                                     | Per Cycle is calculated as total                                                                                                                                                                                       |                                |                                                |                                                              |
| Examples:                                      | ate?                  | Y-Axis: CPU Instructions Per Cy<br>Description: CPU Instructions P<br>number of instructions divided                                                                | cle (SOMs interval)<br>ere Cycle is calculated as total<br>1 by total number of active<br>che Access<br>-33 million ARM processor<br>tierval)<br>cess is the number of the<br>ring the sampling interval<br>0.004<br>L |                                |                                                | 25 25 25 25 25 25 25 25 25 25 25 25 25 2                     |

© Copyright 2016 Xilinx

| Edit Configuration                                                               |                   |               | <ul> <li>Generate Traffic Pattern</li> <li>Pre-defined bitstream</li> <li>Configurable to emulate traf<br/>patterns on multiple ports</li> <li>Simultaneous CPU loading</li> </ul> |              |              |                |                                         |
|----------------------------------------------------------------------------------|-------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|--------------|----------------|-----------------------------------------|
| Modify configurat                                                                |                   |               |                                                                                                                                                                                    |              |              |                |                                         |
| Name: Performance<br>Target Setup A<br>Traffic Duration(sec<br>Configuration: HD | Application 🖀 ATG | Configuration | ↔ Arguments 👼<br>▼ Rena                                                                                                                                                            |              | nbol Files 🖗 | Config         | urable app types<br>pr pre-porting eval |
| Port Location                                                                    | Template Id       | Operation     | Address Start                                                                                                                                                                      | Address Next | Beats/tranx  | Tranx interval | Est. Throughput                         |
| atg_acp                                                                          | <none></none>     |               |                                                                                                                                                                                    |              |              |                | ,                                       |
| atg_acp                                                                          | <none></none>     |               |                                                                                                                                                                                    |              |              |                |                                         |
| atg_hp0                                                                          | <custom></custom> | RD            | ddr0                                                                                                                                                                               | increment    | 16           | 34             | 376                                     |
| atg_hp0                                                                          | <none></none>     |               |                                                                                                                                                                                    |              |              |                |                                         |
| atg_hp1                                                                          | <none></none>     |               |                                                                                                                                                                                    |              |              |                |                                         |
| atg_hp1                                                                          | <custom></custom> | WR            | ddr1                                                                                                                                                                               | increment    | 16           | 34             | 376                                     |
| atg_hp2                                                                          | <custom></custom> | RD            | ddr2                                                                                                                                                                               | increment    | 16           | 34             | 376                                     |
| atg_hp2                                                                          | <none></none>     |               |                                                                                                                                                                                    |              |              |                |                                         |
| atg_hp3                                                                          | <none></none>     |               |                                                                                                                                                                                    |              | 1.1          |                |                                         |
| atg_hp3                                                                          | <custom></custom> | WR            | ddr3                                                                                                                                                                               | increment    | 16           | 34             | 376                                     |
|                                                                                  |                   |               |                                                                                                                                                                                    |              |              |                | Apply Revert                            |



### **Event Trace to Dissect Timing Issues**



XILINX ➤ ALL PROGRAMMABLE.

# SDSoC: FPGA Development through Software



### FPGA Productivity with Technology Advancement





## Typical Zynq Development Flow







### Need to modify multiple levels of design entry

© Copyright 2016 Xilinx

### After SDSoC:



### Remove the manual design of SW drivers and HW connectivity



### After SDSoC:



- Remove the manual design of SW drivers and
   HW connectivity
- > Use the C/C++ end application as the input calling the user algorithm IPs as function calls
- Partition set of functions to Programmable Logic by a single click



### After SDSoC: Automatic System Generation



### C/C++ to System in hours, days

© Copyright 2016 Xilinx





Image processing on the video I/Os via DDR3 memory

### How to Call Accelerators - Programming Paradigms

#### > Explicit Message Passing APIs

- Generic API to transfer data (send/receive, set/get)
- Tasks written in C/C++ (SW) and/or VHDL/Verilog (HW)
- Mental model: Threads communicating with each other

#### > Function call paradigm

- Standard function call paradigm
  - Synchronous or asynchronous
- Mental model: Call an accelerator that returns result

#### > Enqueue work items (OpenCL)

- Compile OpenCL host and kernels
- Kernels compiled to CPU/Neon or FPGA
- Mental model: Enqueue tasks to next available exec unit

#### > High level modeling

- MathWorks MATLAB/Simulink
- National Instruments LabView



send i(port1, A, ...);



### No "right" way of doing this – Depends on application

© Copyright 2016 Xilinx

XILINX > ALL PROGRAMMABLE.

### Summary

#### > Heterogeneous systems are here to stay

- And they will be increasingly complex

#### > Developing for heterogeneous systems is hard

- Each component might have its own language and operating environment
- Parallel programming is hard to get right

#### > New standards, tools, frameworks and APIs are here to help

- Hiding the complexity and unifying the environments

#### > Don't get stuck in old ways

- Embedded developers are conservative
- Never a good time to try new methodologies
- "Boiling frog" syndrome...





