5.6. I am trying to use regions, but tasks are serialised and I think they should not¶
The two available region plugins have an important limitation: you have to align memory to a power of 2. This is an optimisation trade-off (that might be lifted in the distant future).
If you have two matrices of size \(1024 \times 1024\) (32-bit floats), and you want to work with blocks of \(128 \times 128\) elements, you must align to \(sizeof(float[1024][1024])\). The block size does not affect us in this case.
You must carefully choose a way to align memory, depending on your system. We have found problems with static alignment, while posix_memalign usually works for us.
To see if there is, effectively, aliasing, you may want to run you program with the graph instrumentation plugin (Task Dependency Graph). If you see lots of anti/output dependencies, you are likely not aligning your memory properly.
Alignment rules: For every dimension, you must ensure that a rule is true:
Where \(a\) is an address, \(b_i\) is the block size in bytes in dimension \(i\), \(d_j\) the size in number of elements in dimension \(j\).
- For the following example of 2 dimensions:
- The address of a block module the block width must be equal to 0.
- The address of a block module the total width multiplied by the block height must be also 0.
In some cases, if your application uses overlapping blocks with halos, you’ll need to make sure the start position of each block is aligned. Consider the next matrix (6x6):
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 1 | 2 | 2 | 0 |
0 | 1 | 1 | 2 | 2 | 0 |
0 | 3 | 3 | 4 | 4 | 0 |
0 | 3 | 3 | 4 | 4 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
There are 4 blocks (1, 2, 3 and 4) of 4 elements each. If the application has a halo as an input, and the block as an output dependency, it would look like this for the first block:
X | I | I | X | X | X |
I | O | O | I | X | X |
I | O | O | I | X | X |
X | I | I | X | X | X |
X | X | X | X | X | X |
X | X | X | X | X | X |
X, I, O denotes no dependency, input and output, respectively.
A first approach would be aligning to the next power of 2 of \(6\times 6\times sizeof(float)\), but it is not aligned to a power of 2.
Besides that, the blocks are not properly aligned. The first block (1,1) would be in address \(7\times sizeof(float) = 28\) (0x1C). \(28\mod (2\times sizeof(float))\) is 4, not a power of 2.
To solve this, you must pad the data. In this case, we need to allocate an 8x8 matrix and pad data the following way (although we only need 6 rows to comply with the alignment rules):
P | P | P | P | P | P | P | P |
P | 0 | 0 | 0 | 0 | 0 | 0 | P |
P | 0 | 1 | 1 | 2 | 2 | 0 | P |
P | 0 | 1 | 1 | 2 | 2 | 0 | P |
P | 0 | 3 | 3 | 4 | 4 | 0 | P |
P | 0 | 3 | 3 | 4 | 4 | 0 | P |
P | 0 | 0 | 0 | 0 | 0 | 0 | P |
P | P | P | P | P | P | P | P |
P is the padded data. It does not need to be initialised (thus some those pages will not be used).
Then, for the first block the dependencies will be:
X | X | X | X | X | X | X | X |
X | X | I | I | X | X | X | X |
X | I | O | O | I | X | X | X |
X | I | O | O | I | X | X | X |
X | X | I | I | X | X | X | X |
X | X | X | X | X | X | X | X |
X | X | X | X | X | X | X | X |
X | X | X | X | X | X | X | X |
The address of the output block (2,2) is \((8\times 2+2)\times sizeof(float) = 72\). Its mod is 0, so the rule is satisfied for the first dimension.
For the second rule, \(72\mod( 8\times (2\times sizeof(float))) = 8\). Now we apply the first rule for 8 as the address, and its module is 0. It will now work.