# 5.3. Matrix Multiply¶

In this exercise we will work with the Matmul kernel. This algorithm is used to multiply two 2D-matrices (A, B) and store the result in a third one (C). Every matrix has the same size.

The approach can be either “naive”, doing the whole multiplication with one kernel, or tiled (blocked), where we can take advantage of multiple levels of parallelism progressively. Which mode is being used is controlled by the first argument of the executable:

• 1 to enable tiled mode

• empty or anything else for non-tiled mode. (Default behavior)

In case tiled mode is used, another optional integer parameter may be provided, specifying the divisions on each dimension to define the number of tiles. E.g. 4 will mean Dimension/4, hence 4x4 = 16 tiles, 8 will mean 64 tiles etc. Default value is 8, if no argument is given.

The matrices’ dimensions are controlled statically in the header file.

The tiled approach uses the transposed-B variant, to simplify access patterns for A and B, within the multiplication loops.

Exploiting heterogeneous parallelism with OmpSs-2 + devices

As the tile-decomposition allows for multiple independent subsets of the computation, this example is useful to demonstrate multi-level parallelism made easily possible with OmpSs-2 and devices: Each tile (1) of the product matrix requires a series (2) of sub-matrices multiplications. At the bottom level, each such multiplication can be executed in parallel on a GPU (3). In the above sentences, (1), (2) and (3) all present opportunities for parallelism of different sorts.

We suggest invoking a task for each tile of the product.

Then each such task may create a series of children tasks, that will execute the partial multiplications on the device. This way, we can keep the device oversubscribed, as multiple host threads will be launching device tasks asynchronously, and play around with parameters like number of tiles to observe how they affect performance.

Goals of the exercise
• Familiarize with the tiled multiplication approach and detect independent parts for parallelism