5.3. Matrix Multiply

In this exercise we will work with the Matmul kernel. This algorithm is used to multiply two 2D-matrices (A, B) and store the result in a third one (C). Every matrix has the same size.

The approach can be either “naive”, doing the whole multiplication with one kernel, or tiled (blocked), where we can take advantage of multiple levels of parallelism progressively. Which mode is being used is controlled by the first argument of the executable:

  • 1 to enable tiled mode

  • empty or anything else for non-tiled mode. (Default behavior)

In case tiled mode is used, another optional integer parameter may be provided, specifying the divisions on each dimension to define the number of tiles. E.g. 4 will mean Dimension/4, hence 4x4 = 16 tiles, 8 will mean 64 tiles etc. Default value is 8, if no argument is given.

The matrices’ dimensions are controlled statically in the header file.

The tiled approach uses the transposed-B variant, to simplify access patterns for A and B, within the multiplication loops.

Exploiting heterogeneous parallelism with OmpSs-2 + devices

As the tile-decomposition allows for multiple independent subsets of the computation, this example is useful to demonstrate multi-level parallelism made easily possible with OmpSs-2 and devices: Each tile (1) of the product matrix requires a series (2) of sub-matrices multiplications. At the bottom level, each such multiplication can be executed in parallel on a GPU (3). In the above sentences, (1), (2) and (3) all present opportunities for parallelism of different sorts.

We suggest invoking a task for each tile of the product.

Then each such task may create a series of children tasks, that will execute the partial multiplications on the device. This way, we can keep the device oversubscribed, as multiple host threads will be launching device tasks asynchronously, and play around with parameters like number of tiles to observe how they affect performance.

Goals of the exercise
  • Familiarize with the tiled multiplication approach and detect independent parts for parallelism

  • Identify where OmpSs-2 task directives can be placed (the non-tiled matmul is already taskified)

  • Try annotating the tiled computational kernel (mmul_bt_kernel) as an OpenACC device task

Finally, there is also a more advanced challenge: If you choose to launch multiple device tasks concurrently within the same product tile (step 3 above), you have to prevent data races for modifying the product matrix. This will require you to implement manual privatization of the C tile and a manual “reduction” after computations are finished.

In ‘.setup/solved’ folder you will find the completely modified code with manual reductions, that launches multiple device kernels per thread, but we encourage you to try implementing it on your own, and let us know if you come up with a better implementation!