3.2. Writing OmpSs programs

Following examples are written in C/C++ or Fortran using OmpSs as a programming model. With each example we provide simple explanations on how they are annotated and, in some cases, how they can be compiled (if a full example is provided).

3.2.1. Data Management

3.2.1.1. Reusing device data among same device kernel executions

Although memory management is completely done by the runtime system, in some cases we can assume a predefined behaviour. This is the case of the following Fortran example using an OpenCL kernel. If we assume runtime is using a write-back cache policy we can also determine that second kernel call will not imply any data movement.

kernel_1.cl:

__kernel void vec_sum(int n, __global int* a, __global int* b, __global int* res)
{
    const int idx = get_global_id(0);

    if (idx < n) res[idx] = a[idx] + b[idx];
}

test_1.f90:

! NOTE: Assuming write-back cache policy

SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS)
    IMPLICIT NONE
    INTEGER :: N
    INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I
    DO I=1,N
    VEC1(I) = I
    VEC2(I) = N+1-I
    RESULTS(I) = -1
    END DO
END SUBROUTINE INITIALIZE

PROGRAM P
    IMPLICIT NONE
    INTERFACE
        !$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128) FILE(kernel_1.cl) COPY_DEPS
        !$OMP TASK IN(A, B) OUT(RES)
        SUBROUTINE VEC_SUM(N, A, B, RES)
            IMPLICIT NONE
            INTEGER, VALUE :: N
            INTEGER :: A(N), B(N), RES(N)
        END SUBROUTINE VEC_SUM
    END INTERFACE
    INTEGER, PARAMETER :: N  = 20
    INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I

    CALL INITIALIZE(N, VEC1, VEC2, RESULTS)

    CALL VEC_SUM(N, VEC1, VEC2, RESULTS)
    !  The vectors VEC1 and VEC2 are sent to the GPU. The input transfers at this
    !   point are: 2 x ( 20 x sizeof(INTEGER)) = 2 x (20 x 4) = 160 B.

    CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)
    !  All the input data is already in the GPU. We don't need to send
    !   anything.

    !$OMP TASKWAIT
    !  At this point we copy out from the GPU the computed values of RESULTS
    !   and remove all the data from the GPU

    ! print the final vector's values
    PRINT *, "RESULTS: ", RESULTS
END PROGRAM P

!  Expected IN/OUT transfers:
!   IN = 160B
!   OUT = 80B

Compile with:

oclmfc -o test_1 test_1.f90 kernel_1.cl --ompss

3.2.1.2. Forcing data back using a taskwait

In this example, we need to copy back the data in between the two kernel calls. We force this copy back using a taskwait. Note that we are assuming write-back cache policy.

kernel_2.cl:

__kernel void vec_sum(int n, __global int* a, __global int* b, __global int* res)
{
    const int idx = get_global_id(0);

    if (idx < n) res[idx] = a[idx] + b[idx];
}

test_2.f90:

! NOTE: Assuming write-back cache policy

SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS)
    IMPLICIT NONE
    INTEGER :: N
    INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I
    DO I=1,N
    VEC1(I) = I
    VEC2(I) = N+1-I
    RESULTS(I) = -1
    END DO
END SUBROUTINE INITIALIZE

PROGRAM P
    IMPLICIT NONE
    INTERFACE
        !$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128) FILE(kernel_2.cl) COPY_DEPS
        !$OMP TASK IN(A, B) OUT(RES)
        SUBROUTINE VEC_SUM(N, A, B, RES)
            IMPLICIT NONE
            INTEGER, VALUE :: N
            INTEGER :: A(N), B(N), RES(N)
        END SUBROUTINE VEC_SUM
    END INTERFACE
    INTEGER, PARAMETER :: N  = 20
    INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I

    CALL INITIALIZE(N, VEC1, VEC2, RESULTS)

    CALL VEC_SUM(N, VEC1, VEC2, RESULTS)
    !  The vectors VEC1 and VEC2 are sent to the GPU. The input transfers at this
    !   point are: 2 x ( 20 x sizeof(INTEGER)) = 2 x (20 x 4) = 160 B.

    !$OMP TASKWAIT
    !  At this point we copy out from the GPU the computed values of RESULT
    !   and remove all the data from the GPU

    PRINT *, "PARTIAL RESULTS: ", RESULTS

    CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)
    !  The vectors VEC1 and RESULT are sent to the GPU. The input transfers at this
    !   point are: 2 x ( 20 x sizeof(INTEGER)) = 2 x (20 x 4) = 160 B.

    !$OMP TASKWAIT
    !  At this point we copy out from the GPU the computed values of RESULT
    !   and remove all the data from the GPU

    ! print the final vector's values
    PRINT *, "RESULTS: ", RESULTS
END PROGRAM P

!  Expected IN/OUT transfers:
!   IN = 320B
!   OUT = 160B

Compile with:

oclmfc -o test_2 test_2.f90 kernel_2.cl --ompss

3.2.1.3. Forcing data back using a task

This example is similar to the example 1.2 but instead of using a taskwait to force the copy back, we use a task with copies. Note that we are assuming write-back cache policy.

kernel_3.cl:

__kernel void vec_sum(int n, __global int* a, __global int* b, __global int* res)
{
    const int idx = get_global_id(0);

    if (idx < n) res[idx] = a[idx] + b[idx];
}

test_3.f90:

! NOTE: Assuming write-back cache policy

SUBROUTINE INITIALIZE(N, VEC1, VEC2, RESULTS)
    IMPLICIT NONE
    INTEGER :: N
    INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I
    DO I=1,N
    VEC1(I) = I
    VEC2(I) = N+1-I
    RESULTS(I) = -1
    END DO
END SUBROUTINE INITIALIZE

PROGRAM P
    IMPLICIT NONE
    INTERFACE
        !$OMP TARGET DEVICE(OPENCL) NDRANGE(1, N, 128) FILE(kernel_3.cl) COPY_DEPS
        !$OMP TASK IN(A, B) OUT(RES)
        SUBROUTINE VEC_SUM(N, A, B, RES)
            IMPLICIT NONE
            INTEGER, VALUE :: N
            INTEGER :: A(N), B(N), RES(N)
        END SUBROUTINE VEC_SUM

        !$OMP TARGET DEVICE(SMP) COPY_DEPS
        !$OMP TASK IN(BUFF)
        SUBROUTINE PRINT_BUFF(N, BUFF)
            IMPLICIT NONE
            INTEGER, VALUE :: N
            INTEGER :: BUFF(N)
        END SUBROUTINE VEC_SUM
    END INTERFACE

    INTEGER, PARAMETER :: N  = 20
    INTEGER :: VEC1(N), VEC2(N), RESULTS(N), I

    CALL INITIALIZE(N, VEC1, VEC2, RESULTS)

    CALL VEC_SUM(N, VEC1, VEC2, RESULTS)
    !  The vectors VEC1 and VEC2 are sent to the GPU. The input transfers at this
    !   point are: 2 x ( 20 x sizeof(INTEGER)) = 2 x (20 x 4) = 160 B.

    CALL PRINT_BUFF(N, RESULTS)
    !  The vector RESULTS is copied from the GPU to the CPU. The copy of this vector in
    !   the memory of the GPU is not removed because the task 'PRINT_BUFF' does not modify it.
    !   Output transfers: 80B.
    !   VEC1 and VEC2 are still in the GPU.

    CALL VEC_SUM(N, VEC1, RESULTS, RESULTS)
    !  The vectors VEC1 and RESULTS are already in the GPU. Do not copy anything.

    CALL PRINT_BUFF(N, RESULTS)
    !  The vector RESULTS is copied from the GPU to the CPU. The copy of this vector in
    !   the memory of the GPU is not removed because the task 'PRINT_BUFF' does not it.
    !   Output transfers: 80B.
    !   VEC1 and VEC2 are still in the GPU.

    !$OMP TASKWAIT
    !  At this point we remove all the data from the GPU. The right values of the vector RESULTS are
    !   already in the memory of the CPU, then we don't need to copy anything from the GPU.

END PROGRAM P

SUBROUTINE PRINT_BUFF(N, BUFF)
    IMPLICIT NONE
    INTEGER, VALUE :: N
    INTEGER :: BUFF(N)

    PRINT *, "BUFF: ", BUFF
END SUBROUTINE VEC_SUM

!  Expected IN/OUT transfers:
!   IN = 160B
!   OUT = 160B

Compile with:

oclmfc -o test_3 test_3.f90 kernel_3.cl --ompss

3.2.2. Application’s kernels

3.2.2.1. BlackScholes

This benchmark computes the pricing of European-style options. Its kernel has 6 input arrays, and a single output. Offloading is done by means of the following code:

for (i=0; i<array_size; i+= chunk_size ) {
  int elements;
  unsigned int * cpf;
  elements = min(i+chunk_size, array_size ) - i;
  cpf = cpflag;
#pragma omp target device(cuda) copy_in( \
                     cpf   [i;elements], \
                     S0    [i;elements], \
                     K     [i;elements], \
                     r     [i;elements], \
                     sigma [i;elements], \
                     T     [i;elements]) \
                 copy_out (answer[i;elements])
#pragma omp task firstprivate(local_work_group_size, i)
     {
         dim3 dimBlock(local_work_group_size, 1 , 1);
         dim3 dimGrid(elements / local_work_group_size, 1 , 1 );
         cuda_bsop <<<dimGrid, dimBlock>>> (&cpf[i], &S0[i], &K[i],

                      &r[i], &sigma[i], &T[i], &answer[i]);
     }
  }
#pragma omp taskwait

Following image shows graphically the annotations used to offload tasks to the GPUs available. Data arrays annotated with the copy_in clause are automatically transferred by the Nanos++ runtime system onto the GPU global memory. After the CUDA kernel has been executed, the copy_out clause indicates to the runtime system that the results written by the GPU onto the output array should be synchronized onto the host memory. This is done at the latest when the host program encounters the taskwait directive.

../../_images/blackscholes_solution.png

3.2.2.2. Perlin Noise

This benchmark generates an image consisting of noise, useful to be applied to gaming applications, in order to provide realistic effects. The application has a single output array, with the generated image. Annotations are shown here:

for (j = 0; j < img_height; j+=BS) {
  // Each task writes BS rows of the image
  #pragma omp target device(cuda) copy_deps
  #pragma omp task output (output[j*rowstride:(j+BS)*rowstride-1])
  {
    dim3 dimBlock;
    dim3 dimGrid;
    dimBlock.x = (img_width < BSx) ? img_width : BSx;
    dimBlock.y = (BS < BSy) ? BS : BSy;
    dimBlock.z = 1;
    dimGrid.x = img_width/dimBlock.x;
    dimGrid.y = BS/dimBlock.y;
    dimGrid.z = 1;

    cuda_perlin <<<dimGrid, dimBlock>>> (&output[j*rowstride], time, j, rowstride);
  }
}
#pragma omp taskwait noflush

In this example, the noflush clause eliminates the need for the data synchronization implied by the taskwait directive. This is useful when the programmer knows that the next task that will be accessing this result will also be executed in the GPUs, and the host program does not need to access it. The runtime system ensures in this case that the data is consistent across GPUs.

Following image shows the graphical representation of the data, and the way annotations split it across tasks.

../../_images/perlin_noise_solution.png

3.2.2.3. N-Body

This benchmark implements the gravitational forces among a set of particles. It works with an input array (this_particle_array), and an output array (output_array). Mass, velocities, and positions of the particles are kept updated alternatively in each array by means of a pointer exchange. The annotated code is shown here:

void Particle_array_calculate_forces_cuda ( int number_of_particles,
              Particle this_particle_array[number_of_particles],
              Particle output_array[number_of_particles],
              float time_interval )
{
  const int bs = number_of_particles/8;
  size_t num_threads, num_blocks;
  num_threads = ((number_of_particles < MAX_NUM_THREADS) ?
                Number_of_particles :  MAX_NUM_THREADS );
  num_blocks = ( number_of_particles + MAX_NUM_THREADS ) / MAX_NUM_THREADS;
  #pragma omp target device(cuda) copy_deps
  #pragma omp task output( output_array) input(this_particle_array )
    calculate_forces_kernel_naive <<< num_blocks, MAX_NUM_THREADS >>>
                             (time_interval, this_particle_array, number_of_particles,
                              &output_array[first_local], first_local, last_local);
  #pragma omp taskwait
}