.. index::
    single: quar

.. _quar-user_guide:

quar user guide
===============

**quar** takes its name from `La Quar <https://en.wikipedia.org/wiki/La_Quar>`__, which is a small town and municipality located in the comarca of Berguedà, in Catalonia.

The `OmpSs-2\@FPGA releases <https://github.com/bsc-pm-ompss-at-fpga/ompss-2-at-fpga-releases>`__ are automatically installed in the server.
They are available through a module file for each target architecture.
This document describes how to load and use the modules to compile an example application.
Once the modules are loaded, the workflow in the server should be the same as in the Docker images.


General remarks
---------------

* The OmpSs-2\@FPGA toolchain is installed in a version folder under the ``/opt/bsc/`` directory.
* Third-party libraries required to run some programs are installed in the corresponding folder under the ``/opt/lib/`` directory.
* The rest of the software (Xilinx toolchain, slurm, modules, etc.) is installed under the ``/tools/`` directory.


Node specifications
-------------------

* CPU: Intel Xeon Silver 4208 CPU

  * https://ark.intel.com/content/www/us/en/ark/products/193390/intel-xeon-silver-4208-processor-11m-cache-2-10-ghz.html.

* Main memory: 64GB DDR4-3200

* FPGAs:

  * 2x Xilinx Alveo U200

    * https://www.amd.com/en/products/accelerators/alveo/u200/a-u200-a64g-pq-g.html


.. _quar-login:

Logging into the system
-----------------------

quar is accessible from HCA ``ssh.hca.bsc.es``
Alternatively, it can be accessed through the ``8419`` port in HCA and ssh connection will be redirected to the actual host:

.. code-block:: text

    ssh -p 8419 ssh.hca.bsc.es

Also, this can be automated by adding a ``quar`` host into ssh config:

.. code-block:: text

    Host quar
        HostName ssh.hca.bsc.es
        Port 8419


.. _quar-modules:

Module structure
----------------

The ompss-2 modules are:

* ``ompss-2/x86_64/*[release version]*``

This will automatically load the default Vivado version, although an arbitrary version can be loaded before ompss-2:

.. code-block:: text

    module load vivado/2023.2 ompss-2/x86_64/git

To list all available modules in the system run:

.. code-block:: text

    module avail


Build applications
------------------

To generate an application binary and bitstream, you could refer to :ref:`compile-ompss2atfpga-programs` as the steps are general enough.

Note that the appropriate modules need to be loaded. See :ref:`quar-modules`.


.. _quar-running_applications:

Running applications
--------------------

.. _quar-access_fpga:

Get access to an installed fpga
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The server uses Slurm in order to manage access to computation resources.
Therefore, to be able to use the resources of an FPGA, an allocation in one of the partitions has to be made.

You can check the number and name of partitions and nodes by running:

.. code-block:: text

   sinfo -Nel

There is 1 partition in the node:

* ``fpga``: 2x alveo_u200

In order to make an allocation of computing resources, you must run ``salloc`` with the ``--gres`` option.

For instance:

.. code-block:: text

    salloc -p fpga --gres=fpga:BOARD:N

Where ``BOARD`` is the FPGA to allocate and ``N`` the number of FPGAs to allocate.

This command will allocate the number of specified FPGAs with the required tools and file permissions already set by slurm and prevent other users from using those resources.

Once inside an allocation you can run a script or an interactive job with a subset of the allocated resources with ``srun``:

For an interactive job, run:

.. code-block:: text

    srun --gres=fpga:BOARD:N --pty bash

To execute a script, run:

.. code-block:: text

    srun --gres=fpga:BOARD:N script.sh

.. note::
   You can also allocate and run a job in a single command with ``srun``.
   There is no need to pre-allocate resources with ``salloc``.

.. warning::
   Just running an ``salloc`` will not set the OmpSs-2\@FPGA environment variables.
   In order to do so, you must run your job through ``srun``.

Alternatively you can also run your jobs asynchronously through an ``sbatch`` command, passing a slurm job script as argument:

.. code-block:: text

   sbatch --gres=fpga:BOARD:N job_script.sh

Being an example ``job_script.sh``:

.. code:: bash

    #!/bin/bash
    #
    #SBATCH --job-name=ompss-2_fpga_test
    #SBATCH --output=out.txt
    #SBATCH --time=05:00
    #SBATCH --gres=fpga:BOARD:N
    #SBATCH -p fpga

    module load ompss-2/x86_64/git

    cd test
    make binary

    srun --gres=fpga:BOARD:N exec_test.sh

To get information about the active slurm jobs, run:

.. code-block:: text

    squeue

The output should look similar to this:

.. code-block:: text

    JOBID   PARTITION   NAME    USER        ST  TIME    NODES   NODELIST(REASON)
    1312    fpga        bash    afilguer    R   17:14   1       quar


Loading bistreams
^^^^^^^^^^^^^^^^^

The FPGA bitstream needs to be loaded before the application can run.
The ``load_bitstream`` utility is provided in order to simplify programming the FPGA.

.. code-block:: text

    load_bitstream bitstream.bit [index]

The utility receives a second optional parameter to indicate which of the allocated FPGAs to program.
The default behavior is to program all the allocated FPGAs with the bitstream.

To know which FPGAs have been allocated, you can run the ``report_slurm_node`` tool.
The output should be similar to this:

.. code-block:: text

    LOCAL_ID  PCI_DEV       USB_DEV  QDMA_DEV  HWSERVER_PORT  GLOBAL_ID
    0         0000:b3:00.0  001:002  b3000     13330          0
    1         0000:65:00.0  001:003  65000     13331          1

You can also run ``load_bitstream`` with the ``-h`` option to see which FPGAs are available to program:

.. code-block:: text

    Usage load_bitstream bitstream.bit [index]
    Available devices:
    index:  jtag           pcie          usb
    0:      21290594G00LA  0000:b3:00.0  001:002
    1:      21290594G00EA  0000:65:00.0  001:003


Set up qdma queues
^^^^^^^^^^^^^^^^^^

.. note::
    This step is performed by ``load_bitstream`` script, which creates a single bidirectional memory mapped queue.
    This is only needed if other configuration is needed.

For DMA transfers to be performed between system main memory and the FPGA memory, qdma queues has to be set up by the user *prior to any execution*.

In this case ``dma-ctl`` tool is used.
For instance: In order to create and start a memory mapped qdma queue with index 1 run:

.. code-block:: text

    dma-ctl qdmab3000 q add idx 1 mode mm dir bi
    dma-ctl qdmab3000 q start idx 1 mode mm dir bi

OmpSs-2\@FPGA runtime system expects an mm queue at index 1, which can be created with the commands listed above.

In the same fashion, these queues can also be removed:

.. code-block:: text

    dma-ctl qdmab3000 q stop idx 1 mode mm dir bi
    dma-ctl qdmab3000 q del idx 1 mode mm dir bi

For more information, see

.. code-block:: text

    dma-ctl --help


Get current bitstream info
^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to get information about the bitstream currently loaded into the FPGA, the tool ``read_bitinfo`` is installed in the system.

.. code-block:: text

    read_bitinfo

Note that an active slurm reservation is needed in order to query the FPGA.

This call should return something similar to the sample output for a cholesky decomposition application:

.. code-block:: text

    Bitinfo of FPGA 0000:b3:00.0:
    Bitinfo version: 16
    Bitstream user-id: 0xA41B665D
    AIT version: 8.2.0
    Wrapper version: 13
    Number of accelerators: 9
    Board base frequency: 156.25 MHz
    Dedicated FPGA memory: 64 GB
    Memory interleaving: 32768

    Features:
    [ ] Instrumentation
    [ ] Hardware counter
    Interconnect optimization
        [ ] Area
        [x] Performance
    Picos OmpSs Manager
        [ ] AXI-Lite
        [x] Task creation
        [x] Dependencies
        [ ] Lock
        [x] Spawn queues
    [ ] Power monitor (CMS)
    [ ] Thermal monitor (sysmon)
    [ ] OMPIF
    [ ] IMP

    Address map:
    Managed rstn - address 0x1000
    CmdIn - address 0x2000 length 64
    CmdOut - address 0x4000 length 64
    SpawnIn - address 0x6000 length 1024
    SpawnOut - address 0x8000 length 1024
    Hardware counter - not enabled
    POM AXI-Lite - not enabled
    Power monitor (CMS) - not enabled
    Thermal monitor (sysmon) - not enabled

    xtasks accelerator config:
    type        count   freq(KHz)   description
    5862896218  1   300000      cholesky_blocked
    5459683074  1   300000      omp_trsm
    5459653839  1   300000      omp_syrk
    5459186490  6   300000      omp_gemm

    ait command line:
    ait --name=cholesky --board=alveo_u200 -c=300 --user_config=config_files/alveo_u200_performance.json --memory_interleaving_stride=32K --interconnect_priorities --interconnect_regslices --max_deps_per_task=3 --max_args_per_task=3 --max_copies_per_task=3 --picos_tm_size=256 --picos_dm_size=645 --picos_vm_size=775 --wrapper_version 13

    Hardware runtime VLNV:
    bsc:ompss:picos_ompss_manager:7.5


Remote debugging
^^^^^^^^^^^^^^^^

Although it is possible to interact with Vivado's Hardware Manager through ssh-based X forwarding, Vivado's GUI might not be very responsive over remote connections. To avoid this limitation, one might connect a local Hardware Manager instance to targets hosted on quar, completely avoiding X forwarding, as follows.

#. On quar, when allocating an FPGA with Slurm, a Vivado HW server is automatically launched for each FPGA:

   * FPGA 0 uses port 3120
   * FPGA 1 uses port 3121

#. On the local machine, assuming that quar's HW Server runs on port 3120, let all connections to port 3120 be forwarded to quar by doing ``ssh -L 3120:quar:3120 [USER]@ssh.hca.bsc.es -p 8410`` .

#. Finally, from the local machine, connect to quar's hardware server:

   * Open Vivado's Hardware Manager.
   * Launch the "Open target" wizard.
   * Establish a connection to the local HW server, which will be just a bridge to the remote instance.


Enabling OpenCL / XRT mode
^^^^^^^^^^^^^^^^^^^^^^^^^^

FPGA in quar can be used in OpenCL / XRT mode.
Currently, XRT 2022.2 is installed.
To enable XRT the shell has to be configured into the FPGA and the PCIe devices re-enumerated after configuration has finished.

This is done by running

.. code-block:: text

   load_xrt_shell

Note that this has to be done while a slurm job is allocated.
After this process has completed, output from ``lspci -vd 10ee:`` should look similar to this:

.. code-block:: text

    b3:00.0 Processing accelerators: Xilinx Corporation Device 5000
            Subsystem: Xilinx Corporation Device 000e
            Flags: bus master, fast devsel, latency 0, NUMA node 0
            Memory at 383ff0000000 (64-bit, prefetchable) [size=32M]
            Memory at 383ff4000000 (64-bit, prefetchable) [size=256K]
            Capabilities: <access denied>
            Kernel driver in use: xclmgmt
            Kernel modules: xclmgmt

    b3:00.1 Processing accelerators: Xilinx Corporation Device 5001
            Subsystem: Xilinx Corporation Device 000e
            Flags: bus master, fast devsel, latency 0, IRQ 105, NUMA node 0
            Memory at 383ff2000000 (64-bit, prefetchable) [size=32M]
            Memory at 383ff4040000 (64-bit, prefetchable) [size=256K]
            Memory at 383fe0000000 (64-bit, prefetchable) [size=256M]
            Capabilities: <access denied>
            Kernel driver in use: xocl
            Kernel modules: xocl

Also XRT devices should show up as ready when running ``xbutil examine``. Note that the xrt/2022.2 has to be loaded.

.. code-block:: text

   module load xrt/2022.2
   xbutil examine

And it should show this output:

.. code-block:: text

    System Configuration
      OS Name              : Linux
      Release              : 5.4.0-97-generic
      Version              : #110-Ubuntu SMP Thu Jan 13 18:22:13 UTC 2022
      Machine              : x86_64
      CPU Cores            : 16
      Memory               : 63812 MB
      Distribution         : Ubuntu 18.04.2 LTS
      GLIBC                : 2.31
      Model                : PowerEdge T640

    XRT
      Version              : 2.14.354
      Branch               : 2022.2
      Hash                 : 43926231f7183688add2dccfd391b36a1f000bea
      Hash Date            : 2022-10-08 09:49:58
      XOCL                 : 2.14.354, 43926231f7183688add2dccfd391b36a1f000bea
      XCLMGMT              : 2.14.354, 43926231f7183688add2dccfd391b36a1f000bea

    Devices present
    BDF             :  Shell                            Platform UUID                         Device ID       Device Ready*
    -------------------------------------------------------------------------------------------------------------------------
    [0000:b3:00.1]  :  xilinx_u200_gen3x16_xdma_base_2  0B095B81-FA2B-E6BD-4524-72B1C1474F18  user(inst=128)  Yes


    * Devices that are not ready will have reduced functionality when using XRT tools