Multiple Kernel Execution on Alveo U50

Article by Akshay Kamath. Last updated on February 23, 2022.

Kernel HLS C Code

A simple kernel to perform vector addition of integers.

Observe that the two input operands are connected to two different memory banks for simultaneous read.

extern "C" {
	void vadd(
	        const unsigned int *in1, // Read-Only Vector 1
	        const unsigned int *in2, // Read-Only Vector 2
	        unsigned int *out,       // Output Result
	        int size                 // Size in integer
	        )
	{
#pragma HLS INTERFACE m_axi port=in1 bundle=aximm1
#pragma HLS INTERFACE m_axi port=in2 bundle=aximm2
#pragma HLS INTERFACE m_axi port=out bundle=aximm1

	    for(int i = 0; i < size; ++i)
	    {
	        out[i] = in1[i] + in2[i];
	    }
	}
}

Source code reference:

Vitis-Tutorials/Getting_Started/Vitis at 2021.2 · Xilinx/Vitis-Tutorials

Device Binary File Generation

Configuration File

In addition to the kernel source code, a configuration file describing kernel settings is required to generate the xclbin file.

The following config file creates two instances of the kernel - vadd_1 and vadd_2. The kernels are connected to separate pairs of U50 HBM banks as shown.

debug=1
save-temps=1

[connectivity]
nk=vadd:2:vadd_1.vadd_2
sp=vadd_1.in1:HBM[0]
sp=vadd_1.in2:HBM[1]
sp=vadd_1.out:HBM[0]
sp=vadd_2.in1:HBM[2]
sp=vadd_2.in2:HBM[3]
sp=vadd_2.out:HBM[2]

[profile]
data=all:all:all

Compiling Source Code

v++ -c compiles the source code for the vector-add accelerator into a compiled kernel object (.xo file).

v++ -c -t hw --platform xilinx_u50_gen3x16_xdma_201920_3 --config ./u50_dual.cfg -k vadd -I../src ../src/vadd.cpp -o vadd.xo

Sample Console Output

Option Map File Used: '/tools/reconfig/xilinx/Vitis/2021.1/Vitis/2021.1/data/vitis/vpp/optMap.xml'

****** v++ v2021.1.1 (64-bit)
  **** SW Build 3278995 on 2021-07-20-20:33:48
    ** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved.

INFO: [v++ 60-1306] Additional information associated with this v++ compile can be found at:
	Reports: /nethome/akamath47/tutorial/dual_hw/_x/reports/vadd
	Log files: /nethome/akamath47/tutorial/dual_hw/_x/logs/vadd
Running Dispatch Server on port: 44187
INFO: [v++ 60-1548] Creating build summary session with primary output /nethome/akamath47/tutorial/dual_hw/vadd.xo.compile_summary, at Thu Jan 21 10:36:26 2021
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Thu Jan 21 10:36:26 2021
Running Rule Check Server on port:38045
INFO: [v++ 60-1315] Creating rulecheck session with output '/nethome/akamath47/tutorial/dual_hw/_x/reports/vadd/v++_compile_vadd_guidance.html', at Thu Jan 21 10:36:28 2021
INFO: [v++ 60-895]   Target platform: /opt/xilinx/platforms/xilinx_u50_gen3x16_xdma_201920_3/xilinx_u50_gen3x16_xdma_201920_3.xpfm
INFO: [v++ 60-1578]   This platform contains Xilinx Shell Archive '/opt/xilinx/platforms/xilinx_u50_gen3x16_xdma_201920_3/hw/hw.xsa'
INFO: [v++ 74-78] Compiler Version string: 2021.1
INFO: [v++ 60-1302] Platform 'xilinx_u50_gen3x16_xdma_201920_3.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-585] Compiling for hardware target
INFO: [v++ 60-423]   Target device: xilinx_u50_gen3x16_xdma_201920_3
INFO: [v++ 60-242] Creating kernel: 'vadd'

===>The following messages were generated while  performing high-level synthesis for kernel: vadd Log file: /nethome/akamath47/tutorial/dual_hw/_x/vadd/vadd/vitis_hls.log :
INFO: [v++ 204-61] Pipelining loop 'VITIS_LOOP_13_1'.
INFO: [v++ 200-1470] Pipelining result : Target II = NA, Final II = 1, Depth = 4, loop 'VITIS_LOOP_13_1'
INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were satisfied.
INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz
INFO: [v++ 60-594] Finished kernel compilation
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: /nethome/akamath47/tutorial/dual_hw/_x/reports/vadd/system_estimate_vadd.xtxt
INFO: [v++ 60-586] Created vadd.xo
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. 
    vitis_analyzer /nethome/akamath47/tutorial/dual_hw/vadd.xo.compile_summary 
INFO: [v++ 60-791] Total elapsed time: 0h 1m 7s
INFO: [v++ 60-1653] Closing dispatch client.

Linking Object File with Target Platform

v++ -l links the compiled kernel with the target platform and generates the FPGA binary (.xclbin file).

v++ -l -t hw --platform xilinx_u50_gen3x16_xdma_201920_3 --config ./u50_dual.cfg ./vadd.xo -o vadd.xclbin

Sample Console Output

Option Map File Used: '/tools/reconfig/xilinx/Vitis/2021.1/Vitis/2021.1/data/vitis/vpp/optMap.xml'

****** v++ v2021.1.1 (64-bit)
  **** SW Build 3278995 on 2021-07-20-20:33:48
    ** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved.

INFO: [v++ 60-1306] Additional information associated with this v++ link can be found at:
	Reports: /nethome/akamath47/tutorial/dual_hw/_x/reports/link
	Log files: /nethome/akamath47/tutorial/dual_hw/_x/logs/link
Running Dispatch Server on port: 33729
INFO: [v++ 60-1548] Creating build summary session with primary output /nethome/akamath47/tutorial/dual_hw/vadd.xclbin.link_summary, at Fri Jan 21 10:43:14 2022
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Fri Jan 21 10:43:14 2022
Running Rule Check Server on port:42785
INFO: [v++ 60-1315] Creating rulecheck session with output '/nethome/akamath47/tutorial/dual_hw/_x/reports/link/v++_link_vadd_guidance.html', at Fri Jan 21 10:43:16 2022
INFO: [v++ 60-895]   Target platform: /opt/xilinx/platforms/xilinx_u50_gen3x16_xdma_201920_3/xilinx_u50_gen3x16_xdma_201920_3.xpfm
INFO: [v++ 60-1578]   This platform contains Xilinx Shell Archive '/opt/xilinx/platforms/xilinx_u50_gen3x16_xdma_201920_3/hw/hw.xsa'
INFO: [v++ 74-78] Compiler Version string: 2021.1
INFO: [v++ 60-1302] Platform 'xilinx_u50_gen3x16_xdma_201920_3.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-629] Linking for hardware target
INFO: [v++ 60-423]   Target device: xilinx_u50_gen3x16_xdma_201920_3
INFO: [v++ 60-1332] Run 'run_link' status: Not started
INFO: [v++ 60-1443] [10:43:20] Run run_link: Step system_link: Started
INFO: [v++ 60-1453] Command Line: system_link --xo /nethome/akamath47/tutorial/dual_hw/vadd.xo -keep --config /nethome/akamath47/tutorial/dual_hw/_x/link/int/syslinkConfig.ini --xpfm /opt/xilinx/platforms/xilinx_u50_gen3x16_xdma_201920_3/xilinx_u50_gen3x16_xdma_201920_3.xpfm --target hw --output_dir /nethome/akamath47/tutorial/dual_hw/_x/link/int --temp_dir /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link
INFO: [SYSTEM_LINK 60-1316] Initiating connection to rulecheck server, at Fri Jan 21 10:43:23 2022
INFO: [SYSTEM_LINK 82-70] Extracting xo v3 file /nethome/akamath47/tutorial/dual_hw/vadd.xo
INFO: [SYSTEM_LINK 82-53] Creating IP database /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/_sysl/.cdb/xd_ip_db.xml
INFO: [SYSTEM_LINK 82-38] [10:43:24] build_xd_ip_db started: /tools/reconfig/xilinx/Vitis/2021.1/Vitis/2021.1/bin/build_xd_ip_db -ip_search 0  -sds-pf /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/hw.hpfm -clkid 0 -ip /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/iprepo/xilinx_com_hls_vadd_1_0,vadd -o /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/_sysl/.cdb/xd_ip_db.xml
INFO: [SYSTEM_LINK 82-37] [10:43:31] build_xd_ip_db finished successfully
Time (s): cpu = 00:00:09 ; elapsed = 00:00:07 . Memory (MB): peak = 2017.664 ; gain = 0.000 ; free physical = 147493 ; free virtual = 228522
INFO: [SYSTEM_LINK 82-51] Create system connectivity graph
INFO: [SYSTEM_LINK 82-102] Applying explicit connections to the system connectivity graph: /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/cfgraph/cfgen_cfgraph.xml
INFO: [SYSTEM_LINK 82-38] [10:43:31] cfgen started: /tools/reconfig/xilinx/Vitis/2021.1/Vitis/2021.1/bin/cfgen  -nk vadd:2:vadd_1.vadd_2 -sp vadd_1.in1:HBM[0] -sp vadd_1.in2:HBM[1] -sp vadd_1.out:HBM[0] -sp vadd_2.in1:HBM[2] -sp vadd_2.in2:HBM[3] -sp vadd_2.out:HBM[2] -dmclkid 0 -r /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/_sysl/.cdb/xd_ip_db.xml -o /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/cfgraph/cfgen_cfgraph.xml
INFO: [CFGEN 83-0] Kernel Specs: 
INFO: [CFGEN 83-0]   kernel: vadd, num: 2  {vadd_1 vadd_2}
INFO: [CFGEN 83-0] Port Specs: 
INFO: [CFGEN 83-0]   kernel: vadd_1, k_port: in1, sptag: HBM[0]
INFO: [CFGEN 83-0]   kernel: vadd_1, k_port: in2, sptag: HBM[1]
INFO: [CFGEN 83-0]   kernel: vadd_1, k_port: out, sptag: HBM[0]
INFO: [CFGEN 83-0]   kernel: vadd_2, k_port: in1, sptag: HBM[2]
INFO: [CFGEN 83-0]   kernel: vadd_2, k_port: in2, sptag: HBM[3]
INFO: [CFGEN 83-0]   kernel: vadd_2, k_port: out, sptag: HBM[2]
INFO: [CFGEN 83-2228] Creating mapping for argument vadd_1.in1 to HBM[0] for directive vadd_1.in1:HBM[0]
INFO: [CFGEN 83-2228] Creating mapping for argument vadd_1.in2 to HBM[1] for directive vadd_1.in2:HBM[1]
INFO: [CFGEN 83-2228] Creating mapping for argument vadd_1.out to HBM[0] for directive vadd_1.out:HBM[0]
INFO: [CFGEN 83-2228] Creating mapping for argument vadd_2.in1 to HBM[2] for directive vadd_2.in1:HBM[2]
INFO: [CFGEN 83-2228] Creating mapping for argument vadd_2.in2 to HBM[3] for directive vadd_2.in2:HBM[3]
INFO: [CFGEN 83-2228] Creating mapping for argument vadd_2.out to HBM[2] for directive vadd_2.out:HBM[2]
INFO: [SYSTEM_LINK 82-37] [10:43:35] cfgen finished successfully
Time (s): cpu = 00:00:04 ; elapsed = 00:00:04 . Memory (MB): peak = 2017.664 ; gain = 0.000 ; free physical = 147473 ; free virtual = 228502
INFO: [SYSTEM_LINK 82-52] Create top-level block diagram
INFO: [SYSTEM_LINK 82-38] [10:43:35] cf2bd started: /tools/reconfig/xilinx/Vitis/2021.1/Vitis/2021.1/bin/cf2bd  --linux --trace_buffer 1024 --input_file /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/cfgraph/cfgen_cfgraph.xml --ip_db /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/_sysl/.cdb/xd_ip_db.xml --cf_name dr --working_dir /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/_sysl/.xsd --temp_dir /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link --output_dir /nethome/akamath47/tutorial/dual_hw/_x/link/int --target_bd ulp.bd
INFO: [CF2BD 82-31] Launching cf2xd: cf2xd -linux -trace-buffer 1024 -i /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/cfgraph/cfgen_cfgraph.xml -r /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/_sysl/.cdb/xd_ip_db.xml -o dr.xml
INFO: [CF2BD 82-28] cf2xd finished successfully
INFO: [CF2BD 82-31] Launching cf_xsd: cf_xsd -disable-address-gen -bd ulp.bd -dn dr -dp /nethome/akamath47/tutorial/dual_hw/_x/link/sys_link/_sysl/.xsd
INFO: [CF2BD 82-28] cf_xsd finished successfully
INFO: [SYSTEM_LINK 82-37] [10:43:39] cf2bd finished successfully
Time (s): cpu = 00:00:03 ; elapsed = 00:00:04 . Memory (MB): peak = 2017.664 ; gain = 0.000 ; free physical = 147479 ; free virtual = 228513
INFO: [v++ 60-1441] [10:43:39] Run run_link: Step system_link: Completed
Time (s): cpu = 00:00:19 ; elapsed = 00:00:18 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 147517 ; free virtual = 228550
INFO: [v++ 60-1443] [10:43:39] Run run_link: Step cf2sw: Started
INFO: [v++ 60-1453] Command Line: cf2sw -sdsl /nethome/akamath47/tutorial/dual_hw/_x/link/int/sdsl.dat -rtd /nethome/akamath47/tutorial/dual_hw/_x/link/int/cf2sw.rtd -nofilter /nethome/akamath47/tutorial/dual_hw/_x/link/int/cf2sw_full.rtd -xclbin /nethome/akamath47/tutorial/dual_hw/_x/link/int/xclbin_orig.xml -o /nethome/akamath47/tutorial/dual_hw/_x/link/int/xclbin_orig.1.xml
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link
INFO: [v++ 60-1441] [10:43:43] Run run_link: Step cf2sw: Completed
Time (s): cpu = 00:00:03 ; elapsed = 00:00:04 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 147507 ; free virtual = 228541
INFO: [v++ 60-1443] [10:43:43] Run run_link: Step rtd2_system_diagram: Started
INFO: [v++ 60-1453] Command Line: rtd2SystemDiagram
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link
INFO: [v++ 60-1441] [10:43:43] Run run_link: Step rtd2_system_diagram: Completed
Time (s): cpu = 00:00:00.01 ; elapsed = 00:00:00.43 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 147513 ; free virtual = 228547
INFO: [v++ 60-1443] [10:43:43] Run run_link: Step vpl: Started
INFO: [v++ 60-1453] Command Line: vpl -t hw -f xilinx_u50_gen3x16_xdma_201920_3 -g --remote_ip_cache /nethome/akamath47/tutorial/dual_hw/.ipcache -s --output_dir /nethome/akamath47/tutorial/dual_hw/_x/link/int --log_dir /nethome/akamath47/tutorial/dual_hw/_x/logs/link --report_dir /nethome/akamath47/tutorial/dual_hw/_x/reports/link --config /nethome/akamath47/tutorial/dual_hw/_x/link/int/vplConfig.ini -k /nethome/akamath47/tutorial/dual_hw/_x/link/int/kernel_info.dat --webtalk_flag Vitis --temp_dir /nethome/akamath47/tutorial/dual_hw/_x/link --no-info --iprepo /nethome/akamath47/tutorial/dual_hw/_x/link/int/xo/ip_repo/xilinx_com_hls_vadd_1_0 --messageDb /nethome/akamath47/tutorial/dual_hw/_x/link/run_link/vpl.pb /nethome/akamath47/tutorial/dual_hw/_x/link/int/dr.bd.tcl
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link

****** vpl v2021.1.1 (64-bit)
  **** SW Build 3278995 on 2021-07-20-20:33:48
    ** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved.

INFO: [VPL 60-839] Read in kernel information from file '/nethome/akamath47/tutorial/dual_hw/_x/link/int/kernel_info.dat'.
INFO: [VPL 74-78] Compiler Version string: 2021.1
INFO: [VPL 60-423]   Target device: xilinx_u50_gen3x16_xdma_201920_3
INFO: [VPL 60-1032] Extracting hardware platform to /nethome/akamath47/tutorial/dual_hw/_x/link/vivado/vpl/.local/hw_platform
[10:44:17] Run vpl: Step create_project: Started
Creating Vivado project.
[10:44:21] Run vpl: Step create_project: Completed
[10:44:21] Run vpl: Step create_bd: Started
[10:45:40] Run vpl: Step create_bd: RUNNING...
[10:45:49] Run vpl: Step create_bd: Completed
[10:45:49] Run vpl: Step update_bd: Started
[10:46:00] Run vpl: Step update_bd: Completed
[10:46:00] Run vpl: Step generate_target: Started
[10:47:17] Run vpl: Step generate_target: RUNNING...
[10:48:07] Run vpl: Step generate_target: Completed
[10:48:07] Run vpl: Step config_hw_runs: Started
[10:49:09] Run vpl: Step config_hw_runs: Completed
[10:49:09] Run vpl: Step synth: Started
[10:49:41] Block-level synthesis in progress, 0 of 123 jobs complete, 8 jobs running.
[10:50:12] Block-level synthesis in progress, 0 of 123 jobs complete, 8 jobs running.
[10:50:43] Block-level synthesis in progress, 4 of 123 jobs complete, 4 jobs running.
[10:51:15] Block-level synthesis in progress, 8 of 123 jobs complete, 7 jobs running.
[10:51:46] Block-level synthesis in progress, 9 of 123 jobs complete, 8 jobs running.
[10:52:17] Block-level synthesis in progress, 12 of 123 jobs complete, 5 jobs running.
[10:52:48] Block-level synthesis in progress, 16 of 123 jobs complete, 4 jobs running.
[10:53:19] Block-level synthesis in progress, 18 of 123 jobs complete, 6 jobs running.
[10:53:50] Block-level synthesis in progress, 20 of 123 jobs complete, 6 jobs running.
[10:54:21] Block-level synthesis in progress, 21 of 123 jobs complete, 7 jobs running.
[10:54:53] Block-level synthesis in progress, 23 of 123 jobs complete, 7 jobs running.
[10:55:24] Block-level synthesis in progress, 27 of 123 jobs complete, 4 jobs running.
[10:55:55] Block-level synthesis in progress, 30 of 123 jobs complete, 6 jobs running.
[10:56:26] Block-level synthesis in progress, 33 of 123 jobs complete, 6 jobs running.
[10:56:57] Block-level synthesis in progress, 38 of 123 jobs complete, 5 jobs running.
[10:57:29] Block-level synthesis in progress, 40 of 123 jobs complete, 7 jobs running.
[10:58:00] Block-level synthesis in progress, 42 of 123 jobs complete, 7 jobs running.
[10:58:31] Block-level synthesis in progress, 46 of 123 jobs complete, 4 jobs running.
[10:59:03] Block-level synthesis in progress, 48 of 123 jobs complete, 6 jobs running.
[10:59:34] Block-level synthesis in progress, 49 of 123 jobs complete, 7 jobs running.
[11:00:05] Block-level synthesis in progress, 53 of 123 jobs complete, 4 jobs running.
[11:00:37] Block-level synthesis in progress, 56 of 123 jobs complete, 6 jobs running.
[11:01:08] Block-level synthesis in progress, 58 of 123 jobs complete, 6 jobs running.
[11:01:39] Block-level synthesis in progress, 63 of 123 jobs complete, 3 jobs running.
[11:02:11] Block-level synthesis in progress, 67 of 123 jobs complete, 4 jobs running.
[11:02:42] Block-level synthesis in progress, 71 of 123 jobs complete, 4 jobs running.
[11:03:13] Block-level synthesis in progress, 74 of 123 jobs complete, 6 jobs running.
[11:03:45] Block-level synthesis in progress, 78 of 123 jobs complete, 4 jobs running.
[11:04:16] Block-level synthesis in progress, 80 of 123 jobs complete, 7 jobs running.
[11:04:48] Block-level synthesis in progress, 83 of 123 jobs complete, 6 jobs running.
[11:05:19] Block-level synthesis in progress, 88 of 123 jobs complete, 6 jobs running.
[11:05:51] Block-level synthesis in progress, 90 of 123 jobs complete, 7 jobs running.
[11:06:23] Block-level synthesis in progress, 93 of 123 jobs complete, 5 jobs running.
[11:06:54] Block-level synthesis in progress, 97 of 123 jobs complete, 4 jobs running.
[11:07:26] Block-level synthesis in progress, 99 of 123 jobs complete, 6 jobs running.
[11:07:57] Block-level synthesis in progress, 102 of 123 jobs complete, 5 jobs running.
[11:08:29] Block-level synthesis in progress, 107 of 123 jobs complete, 4 jobs running.
[11:09:01] Block-level synthesis in progress, 112 of 123 jobs complete, 4 jobs running.
[11:09:32] Block-level synthesis in progress, 113 of 123 jobs complete, 3 jobs running.
[11:10:04] Block-level synthesis in progress, 116 of 123 jobs complete, 0 jobs running.
[11:10:36] Block-level synthesis in progress, 123 of 123 jobs complete, 0 jobs running.
[11:11:08] Top-level synthesis in progress.
[11:11:39] Top-level synthesis in progress.
[11:12:11] Top-level synthesis in progress.
[11:12:30] Run vpl: Step synth: Completed
[11:12:30] Run vpl: Step impl: Started
[11:19:25] Finished 2nd of 6 tasks (FPGA linking synthesized kernels to platform). Elapsed time: 00h 35m 39s 

[11:19:25] Starting logic optimization..
[11:19:57] Phase 1 Retarget
[11:20:29] Phase 2 Constant propagation
[11:20:29] Phase 3 Sweep
[11:21:32] Phase 4 BUFG optimization
[11:21:32] Phase 5 Shift Register Optimization
[11:21:32] Phase 6 Post Processing Netlist
[11:23:39] Finished 3rd of 6 tasks (FPGA logic optimization). Elapsed time: 00h 04m 14s 

[11:23:39] Starting logic placement..
[11:24:11] Phase 1 Placer Initialization
[11:24:11] Phase 1.1 Placer Initialization Netlist Sorting
[11:25:47] Phase 1.2 IO Placement/ Clock Placement/ Build Placer Device
[11:26:50] Phase 1.3 Build Placer Netlist Model
[11:27:54] Phase 1.4 Constrain Clocks/Macros
[11:27:54] Phase 2 Global Placement
[11:27:54] Phase 2.1 Floorplanning
[11:28:58] Phase 2.1.1 Partition Driven Placement
[11:28:58] Phase 2.1.1.1 PBP: Partition Driven Placement
[11:28:58] Phase 2.1.1.2 PBP: Clock Region Placement
[11:30:01] Phase 2.1.1.3 PBP: Compute Congestion
[11:30:01] Phase 2.1.1.4 PBP: UpdateTiming
[11:30:01] Phase 2.1.1.5 PBP: Add part constraints
[11:30:01] Phase 2.2 Physical Synthesis After Floorplan
[11:30:33] Phase 2.3 Update Timing before SLR Path Opt
[11:30:33] Phase 2.4 Post-Processing in Floorplanning
[11:30:33] Phase 2.5 Global Placement Core
[11:36:23] Phase 2.5.1 Physical Synthesis In Placer
[11:38:31] Phase 3 Detail Placement
[11:38:31] Phase 3.1 Commit Multi Column Macros
[11:38:31] Phase 3.2 Commit Most Macros & LUTRAMs
[11:39:34] Phase 3.3 Small Shape DP
[11:39:34] Phase 3.3.1 Small Shape Clustering
[11:40:06] Phase 3.3.2 Flow Legalize Slice Clusters
[11:40:06] Phase 3.3.3 Slice Area Swap
[11:41:10] Phase 3.4 Place Remaining
[11:41:10] Phase 3.5 Re-assign LUT pins
[11:41:10] Phase 3.6 Pipeline Register Optimization
[11:41:10] Phase 3.7 Fast Optimization
[11:41:42] Phase 4 Post Placement Optimization and Clean-Up
[11:41:42] Phase 4.1 Post Commit Optimization
[11:42:45] Phase 4.1.1 Post Placement Optimization
[11:42:45] Phase 4.1.1.1 BUFG Insertion
[11:42:45] Phase 1 Physical Synthesis Initialization
[11:43:17] Phase 4.1.1.2 BUFG Replication
[11:43:17] Phase 4.1.1.3 Post Placement Timing Optimization
[11:43:17] Phase 4.1.1.4 Replication
[11:44:21] Phase 4.2 Post Placement Cleanup
[11:44:21] Phase 4.3 Placer Reporting
[11:44:21] Phase 4.3.1 Print Estimated Congestion
[11:44:21] Phase 4.4 Final Placement Cleanup
[11:46:28] Finished 4th of 6 tasks (FPGA logic placement). Elapsed time: 00h 22m 48s 

[11:46:28] Starting logic routing..
[11:47:32] Phase 1 Build RT Design
[11:48:35] Phase 2 Router Initialization
[11:48:35] Phase 2.1 Fix Topology Constraints
[11:49:07] Phase 2.2 Pre Route Cleanup
[11:49:39] Phase 2.3 Global Clock Net Routing
[11:49:39] Phase 2.4 Update Timing
[11:51:15] Phase 2.5 Update Timing for Bus Skew
[11:51:15] Phase 2.5.1 Update Timing
[11:51:46] Phase 3 Initial Routing
[11:51:46] Phase 3.1 Global Routing
[11:52:50] Phase 4 Rip-up And Reroute
[11:52:50] Phase 4.1 Global Iteration 0
[11:56:33] Phase 4.2 Global Iteration 1
[11:58:09] Phase 4.3 Global Iteration 2
[11:59:44] Phase 5 Delay and Skew Optimization
[11:59:44] Phase 5.1 Delay CleanUp
[11:59:44] Phase 5.1.1 Update Timing
[12:00:16] Phase 5.2 Clock Skew Optimization
[12:00:16] Phase 6 Post Hold Fix
[12:00:16] Phase 6.1 Hold Fix Iter
[12:00:16] Phase 6.1.1 Update Timing
[12:00:48] Phase 7 Leaf Clock Prog Delay Opt
[12:01:20] Phase 8 Route finalize
[12:01:20] Phase 9 Verifying routed nets
[12:01:52] Phase 10 Depositing Routes
[12:01:52] Phase 11 Post Router Timing
[12:02:24] Finished 5th of 6 tasks (FPGA routing). Elapsed time: 00h 15m 55s 

[12:02:24] Starting bitstream generation..
[12:09:50] Creating bitmap...
[12:15:08] Writing bitstream ./level0_i_ulp_my_rm_partial.bit...
[12:15:08] Finished 6th of 6 tasks (FPGA bitstream generation). Elapsed time: 00h 12m 44s 
[12:15:49] Run vpl: Step impl: Completed
[12:15:50] Run vpl: FINISHED. Run Status: impl Complete!
INFO: [v++ 60-1441] [12:15:52] Run run_link: Step vpl: Completed
Time (s): cpu = 00:02:00 ; elapsed = 01:32:09 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 143485 ; free virtual = 225934
INFO: [v++ 60-1443] [12:15:52] Run run_link: Step rtdgen: Started
INFO: [v++ 60-1453] Command Line: rtdgen
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link
INFO: [v++ 60-991] clock name 'clk_kernel2_in' (clock ID '1') is being mapped to clock name 'KERNEL_CLK' in the xclbin
INFO: [v++ 60-991] clock name 'clk_kernel_in' (clock ID '0') is being mapped to clock name 'DATA_CLK' in the xclbin
INFO: [v++ 60-991] clock name 'hbm_aclk' (clock ID '') is being mapped to clock name 'hbm_aclk' in the xclbin
INFO: [v++ 60-1230] The compiler selected the following frequencies for the runtime controllable kernel clock(s) and scalable system clock(s): System (SYSTEM) clock: hbm_aclk = 450, Kernel (KERNEL) clock: clk_kernel2_in = 500, Kernel (DATA) clock: clk_kernel_in = 300
INFO: [v++ 60-1453] Command Line: cf2sw -a /nethome/akamath47/tutorial/dual_hw/_x/link/int/address_map.xml -sdsl /nethome/akamath47/tutorial/dual_hw/_x/link/int/sdsl.dat -xclbin /nethome/akamath47/tutorial/dual_hw/_x/link/int/xclbin_orig.xml -rtd /nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd.rtd -o /nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd.xml
INFO: [v++ 60-1652] Cf2sw returned exit code: 0
INFO: [v++ 60-1441] [12:15:55] Run run_link: Step rtdgen: Completed
Time (s): cpu = 00:00:03 ; elapsed = 00:00:03 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 143460 ; free virtual = 225909
INFO: [v++ 60-1443] [12:15:55] Run run_link: Step xclbinutil: Started
INFO: [v++ 60-1453] Command Line: xclbinutil --add-section DEBUG_IP_LAYOUT:JSON:/nethome/akamath47/tutorial/dual_hw/_x/link/int/debug_ip_layout.rtd --add-section BITSTREAM:RAW:/nethome/akamath47/tutorial/dual_hw/_x/link/int/partial.bit --force --target hw --key-value SYS:dfx_enable:true --add-section :JSON:/nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd.rtd --append-section :JSON:/nethome/akamath47/tutorial/dual_hw/_x/link/int/appendSection.rtd --add-section CLOCK_FREQ_TOPOLOGY:JSON:/nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd_xml.rtd --add-section BUILD_METADATA:JSON:/nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd_build.rtd --add-section EMBEDDED_METADATA:RAW:/nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd.xml --add-section SYSTEM_METADATA:RAW:/nethome/akamath47/tutorial/dual_hw/_x/link/int/systemDiagramModelSlrBaseAddress.json --key-value SYS:PlatformVBNV:xilinx_u50_gen3x16_xdma_201920_3 --output /nethome/akamath47/tutorial/dual_hw/vadd.xclbin
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link
XRT Build Version: 2.11.634 (2021.1)
       Build Date: 2021-06-08 22:10:49
          Hash ID: 5ad5998d67080f00bca5bf15b3838cf35e0a7b26
Creating a default 'in-memory' xclbin image.

Section: 'DEBUG_IP_LAYOUT'(9) was successfully added.
Size   : 1304 bytes
Format : JSON
File   : '/nethome/akamath47/tutorial/dual_hw/_x/link/int/debug_ip_layout.rtd'

Section: 'BITSTREAM'(0) was successfully added.
Size   : 32492530 bytes
Format : RAW
File   : '/nethome/akamath47/tutorial/dual_hw/_x/link/int/partial.bit'

Section: 'MEM_TOPOLOGY'(6) was successfully added.
Format : JSON
File   : 'mem_topology'

Section: 'IP_LAYOUT'(8) was successfully added.
Format : JSON
File   : 'ip_layout'

Section: 'CONNECTIVITY'(7) was successfully added.
Format : JSON
File   : 'connectivity'

Section: 'CLOCK_FREQ_TOPOLOGY'(11) was successfully added.
Size   : 410 bytes
Format : JSON
File   : '/nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd_xml.rtd'

Section: 'BUILD_METADATA'(14) was successfully added.
Size   : 2488 bytes
Format : JSON
File   : '/nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd_build.rtd'

Section: 'EMBEDDED_METADATA'(2) was successfully added.
Size   : 4504 bytes
Format : RAW
File   : '/nethome/akamath47/tutorial/dual_hw/_x/link/int/vadd.xml'

Section: 'SYSTEM_METADATA'(22) was successfully added.
Size   : 14440 bytes
Format : RAW
File   : '/nethome/akamath47/tutorial/dual_hw/_x/link/int/systemDiagramModelSlrBaseAddress.json'

Section: 'PARTITION_METADATA'(20) was successfully appended to.
Format : JSON
File   : 'partition_metadata'

Section: 'IP_LAYOUT'(8) was successfully appended to.
Format : JSON
File   : 'ip_layout'
Successfully wrote (32544464 bytes) to the output file: /nethome/akamath47/tutorial/dual_hw/vadd.xclbin
Leaving xclbinutil.
INFO: [v++ 60-1441] [12:15:56] Run run_link: Step xclbinutil: Completed
Time (s): cpu = 00:00:00.21 ; elapsed = 00:00:00.38 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 143445 ; free virtual = 225925
INFO: [v++ 60-1443] [12:15:56] Run run_link: Step xclbinutilinfo: Started
INFO: [v++ 60-1453] Command Line: xclbinutil --quiet --force --info /nethome/akamath47/tutorial/dual_hw/vadd.xclbin.info --input /nethome/akamath47/tutorial/dual_hw/vadd.xclbin
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link
INFO: [v++ 60-1441] [12:15:56] Run run_link: Step xclbinutilinfo: Completed
Time (s): cpu = 00:00:00.52 ; elapsed = 00:00:00.64 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 143444 ; free virtual = 225924
INFO: [v++ 60-1443] [12:15:56] Run run_link: Step generate_sc_driver: Started
INFO: [v++ 60-1453] Command Line: 
INFO: [v++ 60-1454] Run Directory: /nethome/akamath47/tutorial/dual_hw/_x/link/run_link
INFO: [v++ 60-1441] [12:15:56] Run run_link: Step generate_sc_driver: Completed
Time (s): cpu = 00:00:00 ; elapsed = 00:00:00.01 . Memory (MB): peak = 1909.562 ; gain = 0.000 ; free physical = 143444 ; free virtual = 225924
WARNING: [v++ 60-2336] Parameter compiler.enableSlrComputeUnitDrc was set to true, but no SLRs were specified via the command line.
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: /nethome/akamath47/tutorial/dual_hw/_x/reports/link/system_estimate_vadd.xtxt
INFO: [v++ 60-586] Created /nethome/akamath47/tutorial/dual_hw/vadd.ltx
INFO: [v++ 60-586] Created vadd.xclbin
INFO: [v++ 60-1307] Run completed. Additional information can be found in:
	Guidance: /nethome/akamath47/tutorial/dual_hw/_x/reports/link/v++_link_vadd_guidance.html
	Timing Report: /nethome/akamath47/tutorial/dual_hw/_x/reports/link/imp/impl_1_hw_bb_locked_timing_summary_routed.rpt
	Vivado Log: /nethome/akamath47/tutorial/dual_hw/_x/logs/link/vivado.log
	Steps Log File: /nethome/akamath47/tutorial/dual_hw/_x/logs/link/link.steps.log

INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. 
    vitis_analyzer /nethome/akamath47/tutorial/dual_hw/vadd.xclbin.link_summary 
INFO: [v++ 60-791] Total elapsed time: **1h 32m 53s**
INFO: [v++ 60-1653] Closing dispatch client.
🤷‍♂️ xclbin generation takes 1.5 hours for such a simple kernel.

U50 Demo

PYNQ on XRT Platforms - Python productivity for Zynq (Pynq)

The following code snippets are executed on CRNCH Lab’s Flubber1 server inside an Anaconda3 environment.

Print Device Platform Info

>>> for i in range(len(pynq.Device.devices)):
>>>     print("{}) {}".format(i, pynq.Device.devices[i].name) + "; Tag: " + pynq.Device.devices[i].tag)
0) xilinx_u50_gen3x16_xdma_201920_3; Tag: xrt0.8404
1) xilinx_u50_gen3x16_xdma_201920_3; Tag: xrt1.8404

>>> print("Active Device Name: " + pynq.Device.active_device.name + "; Tag: " + pynq.Device.active_device.tag)
Active Device Name: xilinx_u50_gen3x16_xdma_201920_3; Tag: xrt0.8404

Print Overlay Attributes

>>> import pynq
>>> ol = pynq.Overlay('vadd.xclbin')
>>> print(ol.__doc__)
Default documentation for overlay vadd.xclbin. The following
    attributes are available on this overlay:
    
    IP Blocks
    ----------
    vadd_1               : pynq.overlay.DefaultIP
    vadd_2               : pynq.overlay.DefaultIP
    
    Hierarchies
    -----------
    None
    
    Interrupts
    ----------
    None
    
    GPIO Outputs
    ------------
    None
    
    Memories
    ------------
    HBM0                 : Memory
    HBM1                 : Memory
    HBM2                 : Memory
    HBM3                 : Memory

Print Kernel Execution Time

import pynq
import numpy as np
import timeit

# Program the FPGA fabric
ol = pynq.Overlay('vadd.xclbin')

# Kernel handles
vadd_A = ol.vadd_1
vadd_B = ol.vadd_2

# Set buffer size
vec_len = 4096

# Allocate buffers
A_in1 = pynq.allocate((vec_len,), 'u4', target=ol.HBM0)
A_in2 = pynq.allocate((vec_len,), 'u4', target=ol.HBM1)
A_out = pynq.allocate((vec_len,), 'u4', target=ol.HBM0)

B_in1 = pynq.allocate((vec_len,), 'u4', target=ol.HBM2)
B_in2 = pynq.allocate((vec_len,), 'u4', target=ol.HBM3)
B_out = pynq.allocate((vec_len,), 'u4', target=ol.HBM2)

# Initialize buffers
A_in1[:] = np.random.randint(low=0, high=100, size=(vec_len,), dtype='u4')
A_in2[:] = 100

B_in1[:] = np.random.randint(low=0, high=100, size=(vec_len,), dtype='u4')
B_in2[:] = 1000

#------------------------------------------------------
# Execute single kernel and print execution time
#------------------------------------------------------
def single_kernel(in1, in2, out, vec_len):
    in1.sync_to_device()
    in2.sync_to_device()
    vadd_A.call(in1, in2, out, vec_len)
    out.sync_from_device()

t1 = timeit.timeit(lambda: single_kernel(A_in1, A_in2, A_out, vec_len), number=100)

if(np.array_equal(A_in1 + A_in2, A_out)):
    print("Single kernel verification SUCCESSFUL!")
else:
    print("Single kernel verification FAILED.")

print("Average execution time for single kernel = " + str(t1/100 * 1000000) + " us\n")

#----------------------------------------------------------------
# Execute the two kernels sequentially and print execution time
#----------------------------------------------------------------
def double_kernel_sequential(A_in1, A_in2, B_in1, B_in2, A_out, B_out):
    A_in1.sync_to_device()
    A_in2.sync_to_device()
    B_in1.sync_to_device()
    B_in2.sync_to_device()
    vadd_A.call(A_in1, A_in2, A_out, vec_len)
    vadd_B.call(B_in1, B_in2, B_out, vec_len)
    A_out.sync_from_device()
    B_out.sync_from_device()

t2 = timeit.timeit(lambda: double_kernel_sequential(A_in1, A_in2, B_in1, B_in2, A_out, B_out), number=100)

if(np.array_equal(A_in1 + A_in2, A_out) and np.array_equal(B_in1 + B_in2, B_out)):
    print("Double kernel (sequential) verification SUCCESSFUL!")
else:
    print("Double kernel (sequential) verification FAILED.")

print("Average execution time for double kernel (sequential) = " + str(t2/100 * 1000000) + " us\n")

#----------------------------------------------------------------
# Execute the two kernels in parallel and print execution time
#----------------------------------------------------------------
def double_kernel_parallel(A_in1, A_in2, B_in1, B_in2, A_out, B_out):
    A_in1.sync_to_device()
    A_in2.sync_to_device()
    B_in1.sync_to_device()
    B_in2.sync_to_device()
    handle_A = vadd_A.start(A_in1, A_in2, A_out, vec_len)
    handle_B = vadd_B.start(B_in1, B_in2, B_out, vec_len)
    handle_A.wait()
    handle_B.wait()
    A_out.sync_from_device()
    B_out.sync_from_device()

t3 = timeit.timeit(lambda: double_kernel_parallel(A_in1, A_in2, B_in1, B_in2, A_out, B_out), number=100)

if(np.array_equal(A_in1 + A_in2, A_out) and np.array_equal(B_in1 + B_in2, B_out)):
    print("Double kernel (parallel) verification SUCCESSFUL!")
else:
    print("Double kernel (parallel) verification FAILED.")

print("Average execution time for double kernel (parallel) = " + str(t3/100 * 1000000) + " us\n")
Single kernel verification SUCCESSFUL!
Average execution time for single kernel = 396.3544103316963 us

Double kernel (sequential) verification SUCCESSFUL!
Average execution time for double kernel (sequential) = 813.8596499338746 us

Double kernel (parallel) verification SUCCESSFUL!
Average execution time for double kernel (parallel) = 573.2781707774848 us

Summary

Sequential execution of two kernels takes almost twice as much time as single kernel, as expected.

Parallel execution takes 50% more time than single kernel but is faster than sequential execution. The overhead could be attributed to additional memory copy to and from device for second kernel.

This successfully demonstrates parallel execution of kernels on U50!