2.3.2.1. CUDA Architecture

Performance guidelines and troubleshooting advices related to OmpSs applications using CUDA are described here.

2.3.2.1.1. Tuning OmpSs Applications Using GPUs for Performance

In general, best performance results are achieved when prefetch and overlap options are enabled. Usually, a write-back cache policy also enhances performance, unless the application needs lots of data communication between host and GPU devices.

Other Nanox options for performance tuning can be found at My application does not run as fast as I think it could.

2.3.2.1.2. Running with cuBLAS (v2)

Since CUDA 4, the first parameter of any cuBLAS function is of type cublasHandle_t. In the case of OmpSs applications, this handle needs to be managed by Nanox, so -–gpu-cublas-init runtime option must be enabled.

From application’s source code, the handle can be obtained by calling cublasHandle_t nanos_get_cublas_handle() API function. The value returned by this function is the cuBLAS handle and it should be used in cuBLAS functions. Nevertheless, the cuBLAS handle is only valid inside the task context and should not be stored in a variable, as it may change over application’s execution. The handle should be obtained through the API function inside all tasks calling cuBLAS.

Example:

#pragma omp target device (cuda) copy_deps
#pragma omp task inout([NB*NB]C) in([NB*NB]A, [NB*NB]B)
void matmul_tile (double* A, double* B, double* C, unsigned int NB)
{
    REAL alpha = 1.0;

    // handle is only valid inside this task context
    cublasHandle_t handle = nanos_get_cublas_handle();

    cublas_gemm(handle, CUBLAS_OP_T, CUBLAS_OP_T, NB, NB, NB, &alpha, A, NB, B, NB, &alpha, C, NB);
}

2.3.2.1.3. GPU Troubleshooting

If the application has an unexpected behavior (either reports incorrect results or crashes), try using Nanox debug version (the application must be recompiled with flag --debug). Nanox debug version makes additional checks, so this version may trigger the cause of the error.

2.3.2.1.3.1. How to Understand GPU’s Error Messages

When Nanox encounters an error, it aborts the execution throwing a FATAL ERROR message. In the case where the error is related to CUDA, the structure of this message is the following:

FATAL ERROR: [ #thread ] what Nanox was trying to do when the error was detected : error reported by CUDA

Note that the true error is the one reported by CUDA; not the one reported by the runtime. The runtime just gives information about the operation it was performing, but this does not mean that this operation caused the error. For example, there can be an error launching a GPU kernel, but (unless this is checked by the user application) the runtime will detect the error at the next call to CUDA runtime that will probably be a memory copy.

2.3.2.1.3.3. Common Errors from GPUs

Here is a list of common errors found when using GPUs and how they can be solved.

  • No cuda capable device detected:
    This means that the runtime is not able to find any GPU in the system. This probably means that GPUs are not detected by CUDA. You can check your CUDA installation or GPU drivers by trying to run any sample application from CUDA SDK, like deviceQuery, and see if GPUs are properly detected by the system or not.
  • All cuda capable devices are busy or unavailable:
    Someone else is using the GPU devices, so Nanox cannot access them. You can check if there are other instances of Nanox running on the machine, or if there is any other application running that may be using GPUs. You will have to wait till this application finishes or frees some GPU devices. The deviceQuery example from CUDA SDK can be used to check GPU’s memory state as it reports the available amount of memory for each GPU. If this number is low, it is most likely that another application is using that GPU device. This error is reported in CUDA 3 or lower. For CUDA 4 or higher an Out of memory error is reported.
  • Out of memory:
    The application or the runtime are trying to allocate memory, but the operation failed because GPU’s main memory is full. Nanox, by default allocates around 95% of each GPU device’s memory (unless this value is modified using –-gpu-max-memory runtime option). If the application needs additional device memory, try to tune the amount of memory that Nanox is allowed to use with –-gpu-max-memory runtime option. This error is also displayed when there is someone else using the GPU device and Nanox cannot allocate device’s memory. Refer to All cuda capable devices are busy or unavailable error to solve this problem.
  • Unspecified launch failure:
    This usually means that a segmentation fault occurred on the device while running a kernel (an invalid or illegal memory position has been accessed by one or more GPU kernel threads). Please, check OmpSs source code directives related to dependencies and copies, compiler warnings and your kernel code.