2.2. Compile OmpSs programs¶
After the Installation of OmpSs you can now start compiling OmpSs programs. For that end you have to use the Mercurium compiler (which you already installed as described in Installation of Mercurium C/C++/Fortran source-to-source compiler).
See sections Compiling an OmpSs program with CUDA tasks and Compiling an OmpSs program with OpenCL tasks for more information. .. index:
double: Mercurium; drivers
2.2.1. Drivers¶
The list of available drivers can be found here: https://github.com/bsc-pm/mcxx/blob/master/doc/md_pages/profiles.md
2.2.1.1. Common flags accepted by Mercurium drivers¶
Usual flags like -O
, -O1
, -O2
, -O3
, -D
, -c
, -o
, … are recognized by Mercurium.
Almost every Mercurium-specific flag is of the form --xxx
.
Mercurium drivers are deliberately compatible with gcc
. This means that flags
of the form -fXXX
, -mXXX
and -Wxxx
are accepted and passed onto the backend
compiler without interpretation by Mercurium drivers.
Warning
In GCC a flag of the form -fXXX
is equivalent to a flag of the form --XXX
. This is not the case in Mercurium.
2.2.1.2. Help¶
You can get a summary of all the flags accepted by Mercurium using --help
with any of the drivers:
$ mcc --help
Usage: mcc options file [file..]
Options:
-h, --help Shows this help and quits
--version Shows version and quits
--v, --verbose Runs verbosely, displaying the programs
invoked by the compiler
...
2.2.1.3. Passing vendor-specific flags¶
While almost every gcc
of the form -fXXX
or -mXXX
can be passed
directly to a Mercurium driver, some other vendor-specific flags may not be
well known or be misunderstood by Mercurium. When this happens, Mercurium has a
generic way to pass parameters to the backend compiler and linker.
--Wn,<comma-separated-list-of-flags>
- Passes comma-separated flags to the native compiler. These flags are used when
Mercurium invokes the backend compiler to generate the object file (
.o
) --Wl,<comma-separated-list-of-flags>
- Passes comma-separated-flags to the linker. These flags are used when Mercurium invokes the linker
--Wp,<comma-separated-list-of-flags>
- Passes comma-separated flags to the C/Fortran preprocessor. These flags are used when Mercurium invokes the preprocessor on a C or Fortran file.
These flags can be combined. Flags --Wp,a --Wp,b
are equivalent to
--Wp,a,b
. Flag --Wnp,a
is equivalent to --Wn,a --Wp,a
Important
Do not confuse --Wl
and --Wp
with the gcc
similar flags -Wl
and -Wp
(note that gcc
ones have a single -
). The latter can be
used with the former, as in --Wl,-Wl,muldefs
. That said, Mercurium
supports -Wl
and -Wp
directly, so -Wl,muldefs
should be enough.
2.2.3. Compiling an OmpSs program with CUDA tasks¶
To compile an OmpSs + CUDA program you must use one of the OmpSs/OmpSs-2 profiles
(see Drivers) and the --cuda
flag. Note that the Nanos++
installation that was provided to Mercurium must have been configured with CUDA
(see Nanos++ configure flags).
The usual structure of an OmpSs program with CUDA tasks involves a set of C/C++
files (usually .c
or .cpp
) and a set of CUDA files (.cu
). You can
use one of our profiles to compile the CUDA files, but the driver will simply
call the NVIDIA Cuda Compiler nvcc
. We discourage mixing C/C++ code and
CUDA code in the same file when using Mercurium. While this was supported in
the past, such approach is regarded as deprecated. The recommended approach is
to move the CUDA code to a .cu
file.
Note
When calling kernels from C, make sure the .cu
file contains an extern "C"
, otherwise C++ mangling
in the name of the kernels will cause the link step to fail.
Warning
Fortran support for CUDA is experimental. Currently you cannot pass a .cu
file to Mercurium: compile them
separatedly using nvcc
.
Consider the following CUDA kernels:
/* cuda-kernels.cu */
extern "C" { // We specify extern "C" because we will call them from a C code
__global__ void init(int n, int *x)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n)
return;
x[i] = i;
}
__global__ void increment(int n, int *x)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n)
return;
x[i]++;
}
} /* extern "C" */
You can use the following OmpSs program in C to call them:
/* cuda-test.c */
#include <stdio.h>
#include <assert.h>
#pragma omp target device(cuda) copy_deps ndrange(1, n, 1)
#pragma omp task out(x[0 : n-1])
__global__ void init(int n, int *x);
#pragma omp target device(cuda) copy_deps ndrange(1, n, 1)
#pragma omp task inout(x[0 : n-1])
__global__ void increment(int n, int *x);
int main(int argc, char *argv[])
{
int x[10];
init(10, x);
increment(10, x);
#pragma omp taskwait
int i;
for (i = 0; i < 10; i++)
{
fprintf(stderr, "x[%d] == %d\n", i, x[i]);
assert(x[i] == (i+1));
}
return 0;
}
Compile these files using mcc
:
$ mcc -o cuda-test cuda-test.c cuda-kernels.cu --ompss --cuda
cuda-test.c:5: note: adding task function 'init'
cuda-test.c:9: note: adding task function 'increment'
cuda-test.c:16: note: call to task function 'init'
cuda-test.c:17: note: call to task function 'increment'
cudacc_cuda-test.cu: info: enabling experimental CUDA support
Note
You can compile this example separatedly, as usual: using -c
and linking the intermediate .o
files.
Now you can run it:
$ ./cuda-test
x[0] == 1
x[1] == 2
x[2] == 3
x[3] == 4
x[4] == 5
x[5] == 6
x[6] == 7
x[7] == 8
x[8] == 9
x[9] == 10
MSG: [12] GPU 0 TRANSFER STATISTICS
MSG: [12] Total input transfers: 0 B
MSG: [12] Total output transfers: 0 B
MSG: [12] Total device transfers: 0 B
MSG: [13] GPU 1 TRANSFER STATISTICS
MSG: [13] Total input transfers: 0 B
MSG: [13] Total output transfers: 40 B
MSG: [13] Total device transfers: 0 B
Note that due to the caching mechanism of Nanos++ only 40 bytes (10 integers) have been transferred from the GPU to the host. See Running OmpSs Programs for detailed information about options affecting the execution of OmpSs programs.
2.2.4. Compiling an OmpSs program with OpenCL tasks¶
To compile an OmpSs + OpenCL program you must use one of the OmpSs/OmpSs-2
profiles (see Drivers) and the --opencl
flag. Note that
the Nanos++ installation that was provided to Mercurium must have been
configured with OpenCL (see Nanos++ configure flags).
Consider this OpenCL file with kernels similar to the CUDA example above:
/* ocl_kernels.cl */
__kernel void init(int n, int __global * x)
{
int i = get_global_id(0);
if (i >= n)
return;
x[i] = i;
}
__kernel void increment(int n, int __global * x)
{
int i = get_global_id(0);
if (i >= n)
return;
x[i]++;
}
We can call these kernels from the OmpSs program:
#include <stdio.h>
#include <assert.h>
#pragma omp target device(opencl) copy_deps ndrange(1, n, 8) file(ocl_kernels.cl)
#pragma omp task out(x[0 : n-1])
void init(int n, int *x);
#pragma omp target device(opencl) copy_deps ndrange(1, n, 8) file(ocl_kernels.cl)
#pragma omp task inout(x[0 : n-1])
void increment(int n, int *x);
int main(int argc, char *argv[])
{
int x[10];
init(10, x);
increment(10, x);
#pragma omp taskwait
int i;
for (i = 0; i < 10; i++)
{
fprintf(stderr, "x[%d] == %d\n", i, x[i]);
assert(x[i] == (i+1));
}
return 0;
}
Note
OpenCL files must be passed to the driver if they appear in file
directives. This also
applies to separate compilation using -c
.
Compile the OmpSs program using mcc
:
$ mcc -o ocl_test ocl_test.c ocl_kernels.cl --ompss
ocl_test.c:5: note: adding task function 'init'
ocl_test.c:9: note: adding task function 'increment'
ocl_test.c:18: note: call to task function 'init'
ocl_test.c:19: note: call to task function 'increment'
Now you can execute it:
$ ./ocl_test
x[0] == 1
x[1] == 2
x[2] == 3
x[3] == 4
x[4] == 5
x[5] == 6
x[6] == 7
x[7] == 8
x[8] == 9
x[9] == 10
MSG: [12] OpenCL dev0 TRANSFER STATISTICS
MSG: [12] Total input transfers: 0 B
MSG: [12] Total output transfers: 0 B
MSG: [12] Total dev2dev(in) transfers: 0 B
MSG: [13] OpenCL dev1 TRANSFER STATISTICS
MSG: [13] Total input transfers: 0 B
MSG: [13] Total output transfers: 40 B
MSG: [13] Total dev2dev(in) transfers: 0 B
Again, due to the caching mechanism of Nanos++, only 40 bytes (10 integers) have been transferred from the OpenCL device (in the example above it is a GPU) to the host.
2.2.5. Problems during compilation¶
While we put big efforts to make a reasonably robust compiler, you may encounter a bug or problem with Mercurium.
There are several errors of different nature that you may run into.
- Mercurium ends abnormally with an internal error telling you to open a ticket.
- Mercurium does not crash but gives an error on your input code and compilation stops, as if your code were not valid.
- Mercurium does not crash, but gives an error involving an
internal-source
. - Mercurium generates wrong code and native compilation fails on an intermediate file.
- Mercurium forgets something in the generated code and linking fails.
2.2.5.1. How can you help us to solve the problem quicker?¶
In order for us to fix your problem we need the preprocessed file.
If your program is C/C++ we need you to do:
- Figure out the compilation command of the file that fails to compile. Make sure you can replicate the problem using that compilation command alone.
- If your compilation command includes
-c
, replace it by-E
. If it does not include-c
simply add-E
. - If your compilation command includes
-o file
(or-o file.o
) replace it by-o file.ii
. If it does not include-o
, simply add-o file.ii
. - Now run the compiler with this modified compilation command. It should have generated a
file.ii
. - These files are usually very large. Please compress them with
gzip
(orbzip2
or any similar tool).
Send us an email to pm-tools@bsc.es with the error message you are experiencing and the (compressed) preprocessed file attached.
If your program is Fortran just the input file may be enough, but you may have to add all the INCLUDE
d files and modules.