.. index::
    single: quar

Quar cluster installation
=============================

.. _installation_quar:

`La Quar <https://en.wikipedia.org/wiki/La_Quar>`__ is a small town and municipality located in the comarca of Berguedà, in Catalonia.

It's also an intel machine containing a xilinx Alveo u200 accelerator card.

The `OmpSs@FPGA
releases <https://github.com/bsc-pm-ompss-at-fpga/ompss-at-fpga-releases>`__
are automatically installed in the Quar cluster. They are available
through a module file for each target architecture. This document
describes how to load and use the modules to compile an example
application. Once the modules are loaded, the workflow in the Quar
cluster should be the same as in the Docker images.

General remarks
---------------

* All software is installed in a version folder under the ``/opt/bsc`` directory.
* During the updates, the installation will not be available for the users' usage.
* Usually, the installation just takes 20 minutes.
* After the installation, an informative email will be sent.


Node specifications
-------------------

* CPU: Intel Xeon Silver 4208 CPU

  * https://ark.intel.com/content/www/us/en/ark/products/193390/intel-xeon-silver-4208-processor-11m-cache-2-10-ghz.html.

* Main memory: 64GB DDR4-3200

* FPGA: Xilinx Alveo U200

  * https://www.xilinx.com/products/boards-and-kits/alveo/u200.html


.. _logging_in:

Logging into quar
-----------------

Quar is accessible from HCA ``ssh.hca.bsc.es``
Alternatively, it can be accessed through the ``4819`` port in HCA
and ssh connection will be redirected to the actual host:

.. code:: bash

    ssh -p 4819 ssh.hca.bsc.es

Also, this can be automated by adding a ``quar`` host into ssh config:

::

    Host quar
        HostName ssh.hca.bsc.es
        Port 8419

.. _quar-modules:

Module structure
----------------

The ompss modules are:

* ompss/x86\_fpga/*[release version]*

It requires having some vivado loaded:

.. code:: bash

    module load vivado ompss/x86_fpga/git

To list all available modules in the system run:

.. code:: bash

    module avail

Build applications
------------------

To generate an application binary and bitstream, you could refer to :ref:`compile-ompssatfpga-programs`
as the steps are general enough.

Note thet the appropriate modules need to be loaded. See :ref:`quar-modules`.

Running applications
--------------------

Get access to an installed fpga
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Quar cluster uses SLURM in order to manage access to computation
resources. Therefore, to be able to use the resources of an FPGA, 
an allocation in one of the partitions has to be made.

There is 1 partition in the cluster: \* ``fpga``: Alveo U200 board

In order to make an allocation, you must run ``salloc``:

.. code:: bash

    salloc -p [partition]

For instance:

.. code:: bash

    salloc -p fpga

Then get the node that has been allocated for you:

::

    squeue

The output should look similar to this:

::

                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                  1312 fpga          bash afilguer  R      17:14      1 quar

Loading bistreams
^^^^^^^^^^^^^^^^^

The fpga bitstream needs to be loaded before the application can run.
The ``load_bitstream`` utility is provided in order to simplify the
FPGA configuration.

::

    load_bitstream bitstream.bit

Set up qdma queues
^^^^^^^^^^^^^^^^^^

.. note::
    This step is performed by ``load_bitstream`` script,
    which creates a single bidirectional memory mapped queue.
    This is only needed if other configuration is needed.

For DMA transfers to be performed between system main memory and the FPGA memory, qdma queues has to be set up by the user *prior to any execution*.

In this case ``dmactl`` tool is used.
For instance: In order to create and start a memory mapped qdma queue with index 1 run:

::

    dmactl qdma02000 q add idx 1 mode mm dir bi
    dmactl qdma02000 q start idx 1 mode mm dir bi

OmpSs runtime system expects an mm queue at index 1, which can be created with the commands listed above.

In the same fashion, these queues can also be removed:

::

    dmactl qdma02000 q stop idx 1 mode mm dir bi
    dmactl qdma02000 q del idx 1 mode mm dir bi

For more information, see

::

    dmactl --help


Get current bitstream info
^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to get information about the bitstream currently loaded into the FPGA, the tool ``read_bitinfo`` is installed in the system.

::

    read_bitinfo

Note that an active slurm reservation is needed in order to query the FPGA.


This call should return something similar to the sample output for a matrix multiplication application:

::

    Bitstream info version: 6
    Number of acc:  5
    Base freq:      300 MHz
    AIT version:    3.8
    Wrapper version 10
    Features:
    0x1c4
    [ ] Instrumentation
    [ ] DMA engine
    [x] Performance interconnect
    [x] Hardware Runtime
    [x] Extended HW runtime
    [x] SOM
    [ ] Picos
    Interconnect level: basic
    xtasks accelerator config:
    type    #ins    name    freq
    0000000006708694863     001     matmulFPGA                      300
    0000000004353056269     004     matmulBlock                     300

    ait command line:
    ait.pyc --disable_utilization_check --name=matmul --board=alveo_u200 -c=300 --hwruntime=som --interconnection_opt=performance --wrapper_version=10

    Hardware runtime VLNV:
    bsc:ompss:smartompssmanager:3.2


Debugging with HW server
^^^^^^^^^^^^^^^^^^^^^^^^

Although it is possible to interact with Vivado's Hardware Manager through ssh-based X forwarding, Vivado's GUI might not be very responsive over remote connections. To avoid this limitation, one might connect a local Hardware Manager instance to targets hosted on Xaloc, completely avoiding X forwarding, as follows.

#. On Xaloc, launch Vivado's HW server by running ``exec hw_server -d`` on Vivado's TCL console.

#. On the local machine, assuming that Xaloc's HW Server runs on port 3121, let all connections to port 3121 be forwarded to quar by doing ``ssh -L 3121:quar:3121 [USER]@ssh.hca.bsc.es -p 8410`` .

#. Finally, from the local machine, connect to Xaloc's hardware server:

   * Open Vivado's Hardware Manager.
   * Launch the "Open target" wizard.
   * Establish a connection to the local HW server, which will be just a bridge to the remote instance.

Enabling OpenCL / XRT mode
^^^^^^^^^^^^^^^^^^^^^^^^^^

FPGA in quar can be used in OpenCL / XRT mode.
Currently, XRT 2022.2 is installed.
To enable XRT the shell has to be configured into the FPGA and the PCIe devices re-enumerated after configuration has finished.

This is done by running

.. code:: bash

   init_xrt

Note that this has to be done while a slurm job is allocated.
After this process has completed, output from ``lspci -vd 10ee:`` should look similar to this:

::

    b3:00.0 Processing accelerators: Xilinx Corporation Device 5000
            Subsystem: Xilinx Corporation Device 000e
            Flags: bus master, fast devsel, latency 0, NUMA node 0
            Memory at 383ff0000000 (64-bit, prefetchable) [size=32M]
            Memory at 383ff4000000 (64-bit, prefetchable) [size=256K]
            Capabilities: <access denied>
            Kernel driver in use: xclmgmt
            Kernel modules: xclmgmt
    
    b3:00.1 Processing accelerators: Xilinx Corporation Device 5001
            Subsystem: Xilinx Corporation Device 000e
            Flags: bus master, fast devsel, latency 0, IRQ 105, NUMA node 0
            Memory at 383ff2000000 (64-bit, prefetchable) [size=32M]
            Memory at 383ff4040000 (64-bit, prefetchable) [size=256K]
            Memory at 383fe0000000 (64-bit, prefetchable) [size=256M]
            Capabilities: <access denied>
            Kernel driver in use: xocl
            Kernel modules: xocl

Also XRT devices should show up as ready when running ``xbutil examine``. Nothe that the xrt/2022.2 has to be loaded.

.. code:: bash

   module load xrt/2022.2
   xbutil examine

And it should show this output:

::

    System Configuration
      OS Name              : Linux
      Release              : 5.4.0-97-generic
      Version              : #110-Ubuntu SMP Thu Jan 13 18:22:13 UTC 2022
      Machine              : x86_64
      CPU Cores            : 16
      Memory               : 63812 MB
      Distribution         : Ubuntu 18.04.2 LTS
      GLIBC                : 2.31
      Model                : PowerEdge T640
    
    XRT
      Version              : 2.14.354
      Branch               : 2022.2
      Hash                 : 43926231f7183688add2dccfd391b36a1f000bea
      Hash Date            : 2022-10-08 09:49:58
      XOCL                 : 2.14.354, 43926231f7183688add2dccfd391b36a1f000bea
      XCLMGMT              : 2.14.354, 43926231f7183688add2dccfd391b36a1f000bea
    
    Devices present
    BDF             :  Shell                            Platform UUID                         Device ID       Device Ready*
    -------------------------------------------------------------------------------------------------------------------------
    [0000:b3:00.1]  :  xilinx_u200_gen3x16_xdma_base_2  0B095B81-FA2B-E6BD-4524-72B1C1474F18  user(inst=128)  Yes
    
    
    * Devices that are not ready will have reduced functionality when using XRT tools