.. index::
    triple: runtime; architectures; CUDA

CUDA Architecture
=================

Performance guidelines and troubleshooting advices related to OmpSs applications using CUDA are described here.


Tuning OmpSs Applications Using GPUs for Performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In general, best performance results are achieved when prefetch and overlap options are enabled. Usually, a *write-back* cache policy also enhances performance, unless the application needs lots of data communication between host and GPU devices.

Other Nanox options for performance tuning can be found at :ref:`faq-performance`.


Running with cuBLAS (v2)
^^^^^^^^^^^^^^^^^^^^^^^^

Since CUDA 4, the first parameter of any cuBLAS function is of type ``cublasHandle_t``. In the case of OmpSs applications, this handle needs to be managed by Nanox, so ``-–gpu-cublas-init`` runtime option must be enabled.

From application's source code, the handle can be obtained by calling ``cublasHandle_t nanos_get_cublas_handle()`` API function. The value returned by this function is the cuBLAS handle and it should be used in cuBLAS functions. Nevertheless, the cuBLAS handle is only valid inside the task context and should not be stored in a variable, as it may change over application's execution. The handle should be obtained through the API function inside all tasks calling cuBLAS.

.. highlight:: bash

Example::

  #pragma omp target device (cuda) copy_deps
  #pragma omp task inout([NB*NB]C) in([NB*NB]A, [NB*NB]B)
  void matmul_tile (double* A, double* B, double* C, unsigned int NB)
  {
      REAL alpha = 1.0;

      // handle is only valid inside this task context
      cublasHandle_t handle = nanos_get_cublas_handle();

      cublas_gemm(handle, CUBLAS_OP_T, CUBLAS_OP_T, NB, NB, NB, &alpha, A, NB, B, NB, &alpha, C, NB);
  }


GPU Troubleshooting
^^^^^^^^^^^^^^^^^^^

If the application has an unexpected behavior (either reports incorrect results or crashes), try using Nanox debug version (the application must be recompiled with flag ``--debug``). Nanox debug version makes additional checks, so this version may trigger the cause of the error.


How to Understand GPU's Error Messages
""""""""""""""""""""""""""""""""""""""

When Nanox encounters an error, it aborts the execution throwing a **FATAL ERROR** message. In the case where the error is related to CUDA, the structure of this message is the following:

**FATAL ERROR: [** *#thread* **]** *what Nanox was trying to do when the error was detected* **:** *error reported by CUDA*

Note that the true error is the one reported by CUDA; not the one reported by the runtime. The runtime just gives information about the operation it was performing, but this does not mean that this operation caused the error. For example, there can be an error launching a GPU kernel, but (unless this is checked by the user application) the runtime will detect the error at the next call to CUDA runtime that will probably be a memory copy.


Recommended Steps When an Error Is Found
""""""""""""""""""""""""""""""""""""""""

#. Compile the application with ``--debug`` and ``-ggdb3`` flags, optionally use ``-O0`` flag, too.

#. Run normally and check if any **FATAL ERROR** is triggered. This will give a hint of what is causing the error. For getting more information, proceed to the following step.

#. Run the application inside gdb or cuda-gdb to find out the exact point where the application crashes and explore the backtrace, values of variables, introduce breakpoints for further analysis or any other tool you consider useful. You can also check :ref:`faq-crash` for further information.

#. If you think that the error is a bug in OmpSs, please report it.


Common Errors from GPUs
"""""""""""""""""""""""

Here is a list of common errors found when using GPUs and how they can be solved.

* **No cuda capable device detected:**
   This means that the runtime is not able to find any GPU in the system. This probably means that GPUs are not detected by CUDA. You can check your CUDA installation or GPU drivers by trying to run any sample application from CUDA SDK, like *deviceQuery*, and see if GPUs are properly detected by the system or not.

.. _run-arch-cuda-comm-err-busy:

* **All cuda capable devices are busy or unavailable:**
   Someone else is using the GPU devices, so Nanox cannot access them. You can check if there are other instances of Nanox running on the machine, or if there is any other application running that may be using GPUs. You will have to wait till this application finishes or frees some GPU devices. The *deviceQuery* example from CUDA SDK can be used to check GPU's memory state as it reports the available amount of memory for each GPU. If this number is low, it is most likely that another application is using that GPU device. This error is reported in CUDA 3 or lower. For CUDA 4 or higher an :ref:`Out of memory <run-arch-cuda-comm-err-out-mem>` error is reported.

.. _run-arch-cuda-comm-err-out-mem:

* **Out of memory:**
   The application or the runtime are trying to allocate memory, but the operation failed because GPU's main memory is full. Nanox, by default allocates around 95% of each GPU device's memory (unless this value is modified using ``–-gpu-max-memory`` runtime option). If the application needs additional device memory, try to tune the amount of memory that Nanox is allowed to use with ``–-gpu-max-memory`` runtime option. This error is also displayed when there is someone else using the GPU device and Nanox cannot allocate device's memory. Refer to :ref:`All cuda capable devices are busy or unavailable <run-arch-cuda-comm-err-busy>` error to solve this problem.


* **Unspecified launch failure:**
   This usually means that a segmentation fault occurred on the device while running a kernel (an invalid or illegal memory position has been accessed by one or more GPU kernel threads). Please, check OmpSs source code directives related to dependencies and copies, compiler warnings and your kernel code.