6.3. Quar cluster installation¶
La Quar is a small town and municipality located in the comarca of Berguedà, in Catalonia.
It’s also an intel machine containing two Xilinx Alveo U200 accelerator cards.
The OmpSs-2@FPGA releases are automatically installed in the Quar cluster. They are available through a module file for each target architecture. This document describes how to load and use the modules to compile an example application. Once the modules are loaded, the workflow in the Quar cluster should be the same as in the Docker images.
6.3.1. General remarks¶
- The OmpSs@FPGA toolchain is installed in a version folder under the
/opt/bsc/
directory. - Third-party libraries required to run some programs are installed in the corresponding folder under the
/opt/lib/
directory. - The rest of the software (Xilinx toolchain, slurm, modules, etc.) is installed under the
/tools/
directory. - During the updates, the installation will not be available for the users’ usage.
- Usually, the installation takes about 30 minutes.
- After the installation, an informative email will be sent.
6.3.2. Node specifications¶
- CPU: Intel Xeon Silver 4208 CPU
- Main memory: 64GB DDR4-3200
- FPGA: Xilinx Alveo U200
6.3.3. Logging into quar¶
Quar is accessible from HCA ssh.hca.bsc.es
Alternatively, it can be accessed through the 4819
port in HCA
and ssh connection will be redirected to the actual host:
ssh -p 4819 ssh.hca.bsc.es
Also, this can be automated by adding a quar
host into ssh config:
Host quar
HostName ssh.hca.bsc.es
Port 8419
6.3.4. Module structure¶
The ompss-2 modules are:
ompss-2/x86_64/*[release version]*
This will automatically load the default Vivado version, although an arbitrary version can be loaded before ompss:
module load vivado/2023.2 ompss-2/x86_64/git
To list all available modules in the system run:
module avail
6.3.5. Build applications¶
To generate an application binary and bitstream, you could refer to Compile OmpSs-2@FPGA programs as the steps are general enough.
Note that the appropriate modules need to be loaded. See Module structure.
6.3.6. Running applications¶
6.3.6.1. Get access to an installed fpga¶
Quar cluster uses SLURM in order to manage access to computation resources. Therefore, to be able to use the resources of an FPGA, an allocation in one of the partitions has to be made.
There is 1 partition in the cluster:
fpga
: two Alveo U200 boards
The easiest way to allocate an FPGA is to run bash through srun
with the --gres
option:
srun --gres=fpga:BOARD:N --pty bash
Where BOARD
is the FPGA to allocate, in this case alveo_u200
, and N
the number of FPGAs to allocate, either 1 or 2.
For instance, the command:
srun --gres=fpga:alveo_u200:2 --pty bash
Will allocate both FPGAs and run an interactive bash with the required tools and file permissions already set by slurm. To get information about the active slurm jobs, run:
squeue
The output should look similar to this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1312 fpga bash afilguer R 17:14 1 quar
6.3.6.2. Loading bistreams¶
The FPGA bitstream needs to be loaded before the application can run.
The load_bitstream
utility is provided in order to simplify the
FPGA configuration.
load_bitstream bitstream.bit [index]
The utility receives a second optional parameter to indicate which of the allocated FPGAs to program, the default behavior is to program all the allocated FPGAs with the bitstream.
To know which FPGAs indices have been allocated,
run load_bitstream
with the help (-h
) option. The output should be similar to this:
Usage load_bitstream bitstream.bit [index]
Available devices:
index: jtag pcie usb
0: 21290594G00LA 0000:b3:00.0 001:002
1: 21290594G00EA 0000:65:00.0 001:003
6.3.6.3. Set up qdma queues¶
Note
This step is performed by load_bitstream
script,
which creates a single bidirectional memory mapped queue.
This is only needed if other configuration is needed.
For DMA transfers to be performed between system main memory and the FPGA memory, qdma queues has to be set up by the user prior to any execution.
In this case dmactl
tool is used.
For instance: In order to create and start a memory mapped qdma queue with index 1 run:
dmactl qdmab3000 q add idx 1 mode mm dir bi
dmactl qdmab3000 q start idx 1 mode mm dir bi
OmpSs runtime system expects an mm queue at index 1, which can be created with the commands listed above.
In the same fashion, these queues can also be removed:
dmactl qdmab3000 q stop idx 1 mode mm dir bi
dmactl qdmab3000 q del idx 1 mode mm dir bi
For more information, see
dmactl --help
6.3.6.4. Get current bitstream info¶
In order to get information about the bitstream currently loaded into the FPGA, the tool read_bitinfo
is installed in the system.
read_bitinfo
Note that an active slurm reservation is needed in order to query the FPGA.
This call should return something similar to the sample output for a matrix multiplication application:
Reading bitinfo of FPGA 0000:b3:00.0
Bitstream info version: 11
Number of acc: 8
AIT version: 7.1.0
Wrapper version 13
Board base frequency (Hz) 156250000
Interleaving stride 32768
Features:
[ ] Instrumentation
[ ] Hardware counter
[x] Performance interconnect
[ ] Simplified interconnection
[ ] POM AXI-Lite
[x] POM task creation
[x] POM dependencies
[ ] POM lock
[x] POM spawn queues
[ ] Power monitor (CMS) enabled
[ ] Thermal monitor (sysmon) enabled
Cmd In addr 0x2000 len 128
Cmd Out addr 0x4000 len 128
Spawn In addr 0x6000 len 1024
Spawn Out addr 0x8000 len 1024
Managed rstn addr 0xA000
Hardware counter addr 0x0
POM AXI-Lite addr 0x0
Power monitor (CMS) addr 0x0
Thermal monitor (sysmon) addr 0x0
xtasks accelerator config:
type count freq(KHz) description
5839957875 1 300000 matmulFPGA
7602000973 7 300000 matmulBlock
ait command line:
ait --name=matmul --board=alveo_u200 -c=300 --memory_interleaving_stride=32K --simplify_interconnection --interconnect_opt=performance --interconnect_regslice=all --floorplanning_constr=all --slr_slices=all --placement_file=u200_placement_7x256.json --wrapper_version 13
Hardware runtime VLNV:
bsc:ompss:picosompssmanager:7.3
bitinfo note:
''
6.3.6.5. Remote debugging¶
Although it is possible to interact with Vivado’s Hardware Manager through ssh-based X forwarding, Vivado’s GUI might not be very responsive over remote connections. To avoid this limitation, one might connect a local Hardware Manager instance to targets hosted on Quar, completely avoiding X forwarding, as follows.
- On Quar, when allocating an FPGA with slurm, a Vivado HW server is automatically launched for each FPGA:
- FPGA 0 uses port 3120
- FPGA 1 uses port 3121
- On the local machine, assuming that Quar’s HW Server runs on port 3120, let all connections to port 3120 be forwarded to quar by doing
ssh -L 3120:quar:3120 [USER]@ssh.hca.bsc.es -p 8410
. - Finally, from the local machine, connect to Quar’s hardware server:
- Open Vivado’s Hardware Manager.
- Launch the “Open target” wizard.
- Establish a connection to the local HW server, which will be just a bridge to the remote instance.
6.3.6.6. Enabling OpenCL / XRT mode¶
FPGA in quar can be used in OpenCL / XRT mode. Currently, XRT 2022.2 is installed. To enable XRT the shell has to be configured into the FPGA and the PCIe devices re-enumerated after configuration has finished.
This is done by running
load_xrt_shell
Note that this has to be done while a slurm job is allocated.
After this process has completed, output from lspci -vd 10ee:
should look similar to this:
b3:00.0 Processing accelerators: Xilinx Corporation Device 5000
Subsystem: Xilinx Corporation Device 000e
Flags: bus master, fast devsel, latency 0, NUMA node 0
Memory at 383ff0000000 (64-bit, prefetchable) [size=32M]
Memory at 383ff4000000 (64-bit, prefetchable) [size=256K]
Capabilities: <access denied>
Kernel driver in use: xclmgmt
Kernel modules: xclmgmt
b3:00.1 Processing accelerators: Xilinx Corporation Device 5001
Subsystem: Xilinx Corporation Device 000e
Flags: bus master, fast devsel, latency 0, IRQ 105, NUMA node 0
Memory at 383ff2000000 (64-bit, prefetchable) [size=32M]
Memory at 383ff4040000 (64-bit, prefetchable) [size=256K]
Memory at 383fe0000000 (64-bit, prefetchable) [size=256M]
Capabilities: <access denied>
Kernel driver in use: xocl
Kernel modules: xocl
Also XRT devices should show up as ready when running xbutil examine
. Note that the xrt/2022.2 has to be loaded.
module load xrt/2022.2
xbutil examine
And it should show this output:
System Configuration
OS Name : Linux
Release : 5.4.0-97-generic
Version : #110-Ubuntu SMP Thu Jan 13 18:22:13 UTC 2022
Machine : x86_64
CPU Cores : 16
Memory : 63812 MB
Distribution : Ubuntu 18.04.2 LTS
GLIBC : 2.31
Model : PowerEdge T640
XRT
Version : 2.14.354
Branch : 2022.2
Hash : 43926231f7183688add2dccfd391b36a1f000bea
Hash Date : 2022-10-08 09:49:58
XOCL : 2.14.354, 43926231f7183688add2dccfd391b36a1f000bea
XCLMGMT : 2.14.354, 43926231f7183688add2dccfd391b36a1f000bea
Devices present
BDF : Shell Platform UUID Device ID Device Ready*
-------------------------------------------------------------------------------------------------------------------------
[0000:b3:00.1] : xilinx_u200_gen3x16_xdma_base_2 0B095B81-FA2B-E6BD-4524-72B1C1474F18 user(inst=128) Yes
* Devices that are not ready will have reduced functionality when using XRT tools