3.3.1. Scheduling policies¶
The scheduling policy defines how ready tasks are executed. Ready tasks are those whose dependencies have been satisfied therefore its execution can start immediately. The scheduling policy has to decide the order of execution of tasks and the resource where each task will be executed.
When you inject a dependant task into the runtime system actually you are inserting this task into the dependant task graph. Once all dependencies for a given task are fulfilled this task will be inserted into a ready queue.
Before taking into account any scheduling criteria, we must consider four different scheduler modifiers. They are:
- Throttling policy at task creation. Just when we are going to create a new task, the runtime system determines (according to a throttling policy) whether to create it and push it into the scheduler system or just create a minimal description of the task and execute it right away in the current task context. Throttle mechanism can be configured as a completely independent plugin using the throttle option. See Throttling policies.
- Task Priority. When inserting a new task in a priority queue, if the new task has the same priority as another task already in the queue, the new one will be inserted before/after the existent one (according with the FIFO/LIFO behaviour between tasks with the same priority). The priority is a number >= 0. Given two tasks with priority A and B, where A > B, the task with priority A will be fetched earlier than the one with B from the queue. Priority is also accumulate from parent to children. When a task T with priority A creates a task Tc that was given priority B by the user, the priority of Tc will be added to that of its parent. In other words, the priority of Tc will be A + B. Smart priority behaviour also propagates the priority to the immediate preceding tasks (when dependant tasks). Ready task priority queues are only available in some schedulers. Check specific scheduler parameters and/or description for more details.
- Immediate successor. Releasing last dependency when exiting a task. When a task is in the dependency graph and its last immediate predecessor finishes execution, we are able to run the blocked tasks immediately instead of adding it to the ready queue. Sometimes immediate successor is not a configurable option. Check specific scheduler description and/or parameters for more details.
- Parent continuation when last child has been executed. When executing nested OpenMP/OmpSs applications (explicit tasks creating more explicit tasks) it may happens that a given task becomes blocked due the execution of a taskwait or taskwait on directive. When the last child has finished its execution the parent will be promptly resumed instead of being push back into the ready queue (which potentially could delay its execution). Parent continuation is only available in wf scheduler. Check specific scheduler description for more details.
When a thread has no work assigned or when it is waiting until a certain condition is met (e.g. all task’s children completes in a taskwait), threads are in idle loop and conditional wait respectively. Users can also specify which is the behaviour of these thread duties through the following options (for more info see Thread Manager):
--checks=<n> | Set number of checks on a conditional wait loop before spinning on the ready task pool (default = 1). |
--spins=<n> | Set number of spin iterations before yielding on idle loop and conditional wait loop (default = 1). |
--enable-yield | Enables thread yielding on idle loop and conditional wait loop (disabled by default). |
--yields=<n> | Set number of yields before blocking on idle loop and conditional wait loop (default = 1 ). |
--enable-block | Enables thread blocking on idle loop and conditional wait loop (disabled by default). |
A complete idle loop cycle starts with spinning many times as spins parameter specifies After spinning, if yield is enabled, it will yield once and it will start again spinning. The whole spin/yield process will be repeated as many time as the yields option specifies and, finally, if thread block is enabled, the thread will block. Yield and block can not happen in the same cycle.
The conditional wait loop will start with checking the wait condition as many times as the checks parameter specifies. After that number of checks it will start a idle loop cycle (described in the previous paragraph).
The scheduler plug-in can be specified by the NX_SCHEDULE
environment variable or the --schedule
option included in the NX_ARGS
environment variable:
$export NX_SCHEDULE=<module>
$./myProgram
$export NX_ARGS="--schedule[ \|=]<module> ..."
$./myProgram
--schedule=<module> | |
Set the scheduling policy to be used during the execution. The argument module can be one of the following options: bf , dbf , wf , socket , affinity , affinity-smartpriority or versioning . The most suitable scheduling policy can depend on the application and architecture used. |
Nanos++ implements several schedule plug-ins. The following sections will show a description of the available plug-ins with respect to scheduling policies. Some of these modules also incorporate specific options.
3.3.1.1. Breadth First¶
Important
This is the default scheduler
Configuration string:
NX_SCHEDULE=bf
orNX_ARGS="--schedule=bf"
Description: This scheduler policy only implements a single/global ready queue. When creating a task with no dependences (or when a task becomes ready after all its dependences has been fulfilled) it is placed in this ready queue. Ready queue is ordered following a FIFO (First In First Out) algorithm by default, but it can be changed through parameters. Breadth first implements immediate successor mechanism by default (if not in conflict with priority).
Parameters:
--bf-stack, --no-bf-stack Retrieving from queue’s back or queue’s front respectively. Using –bf-stack parameters transform ready queue into a LIFO (Last In First Out) structure. Effectively, this means that tasks are inserted in queue’s back and retrieved from the same position (queue’s back).(default behaviour is –no-bf-stack)
--bf-use-stack, --no-bf-use-stack This is an alias of the previous parameter.
--schedule-priority, --no-schedule-priority Use priorities queues.
--schedule-smart-priority, --no-schedule-smart-priority Use smart priority queues.
3.3.1.2. Distributed Breadth First¶
Configuration string:
NX_SCHEDULE=dbf
orNX_ARGS="--schedule=dbf"
Description: Is a breadth first algorithm implemented with thread local queues instead of having a single/global ready queue. Each thread inserts its created tasks into its own ready queue following a FIFO policy (First In First Out, inserting in queue’s front and retrieving from queue’s back) algorithm. If thread local ready queue is empty, it tries to execute current task’s parent (if it is queued in any other thread ready queue). In the case parent task cannot be eligible for execution it steals from next thread ready queue. Stealing retrieves tasks from queue’s front (i.e. the opposite side from local retrieve).
--schedule-priority, --no-schedule-priority Use priorities queues.
--schedule-smart-priority, --no-schedule-smart-priority Use smart priority queues.
3.3.1.3. Work First¶
Configuration string:
NX_SCHEDULE=wf
orNX_ARGS="--schedule=wf"
Description: This scheduler policy implements a local ready queue per thread. Once a task is created it chooses to continue with the new created task, leaving current task (creator) into current thread’s ready queue. Default behaviour is implemented through FIFO access to local queue, LIFO access on steal and steals parent if available which actually is equivalent with the cilk scheduler behaviour.
Parameters:
--wf-steal-parent, --no-wf-steal-parent If local ready queue is empty, it tries to steal parent task if available (default: wf-steal-parent).
--wf-local-policy=<string> Defines queue access for local ready queue. Configuration string can be:
FIFO
: First In First Out queue access (default).LIFO
: Last In First Out queue access.
--wf-steal-policy=<string> Defines queue access for stealing. Configuration string can be:
FIFO
: First In First Out queue access (default).LIFO
: Last In First Out queue access.
3.3.1.4. Socket-aware scheduler¶
Configuration string:
NX_SCHEDULE=socket
orNX_ARGS="--schedule=socket"
Description: This scheduler will assign top level tasks (depth 1) to a NUMA node set by the user before task creation while nested tasks will run in the same node as their parent. To do that, the user must call the nanos_current_socket function before executing tasks to set the NUMA node the task will be assigned to. The queues are sorted by priority, and there are as many queues as NUMA nodes specified (see num-sockets parameter). Besides that, changing the binding start and stride is not supported. Work stealing is optional. By default, work stealing of child tasks is enabled. Upon initialisation, the policy will create lists of valid nodes to steal. For each node, the policy will only steal from the closest nodes. Use numactl –hardware to print the distance matrix that the policy will use. For each node, a pointer to the next node to steal is kept; if a node steals a task, it will not affect where other nodes will steal. There is an option to steal from random nodes as well, and to steal top level tasks instead of child tasks.
Parameters: There are important parameters as the number of sockets and the number of cores per socket. If Nanos++ is linked against the hwloc library, it will be used to get that information automatically. Besides that, if the number of cores per socket, sockets and number of processing elements do not make sense, the number of sockets will be adjusted.
--num-sockets Sets the number of NUMA nodes.
--cores-per-socket Sets the number of hardware threads per NUMA node.
--socket-steal, --no-socket-steal Enable work stealing from the ready queues of other NUMA nodes (default).
--socket-random-steal, --no-socket-random-steal Instead of round robin stealing, steal from random queues.
--socket-steal-parents, --no-socket-steal-parents Steal depth 1 tasks instead of child tasks (disabled by default).
--socket-steal-spin Number of spins before attempting work stealing.
--socket-smartpriority, --no-socket-smartpriority Enable smart priority propagation (disabled by default).
--socket-immediate, --no-socket-immediate Use getImmediateSuccessor when prefetching (disabled by default). Note: this is not completely implemented (atBeforeExit).
--socket-steal-low-priority, --no-socket-steal-low-priority Steal low priority tasks from the other nodes’ queues. Note: this is currently broken.
3.3.1.5. Affinity¶
Warning
This policy has been discontinued in master development branch, but it is still used in the cluster development branch
- Configuration string:
NX_SCHEDULE=affinity
orNX_ARGS="--schedule=affinity"
- Description: Take into account where the data used by a task is located. Meaningful only in Multiple Address Space architectures (or SMP NUMA). Affinity implements immediate successor by default.
3.3.1.6. Affinity Smart Priority¶
Warning
This policy has been discontinued in master development branch, but it is still used in the cluster development branch
- Configuration string:
NX_SCHEDULE=affinity-smartpriority
orNX_ARGS="--schedule=affinity-smart-priority"
- Description: Affinity policy with smart priority support. Works exactly like the affinity policy (same number of queues, it also has work stealing) but uses sorted priority queues with priority propagation to immediate preceding tasks. When inserting a new task in the queue, if the new task has the same priority as another task already in the queue, the new one will be inserted after the existent one (note that, unlike in the priority scheduling policy, there is no option to change this behaviour). Affinity implements immediate successor by default.
3.3.1.7. Versioning¶
Configuration string:
NX_SCHEDULE=versioning
orNX_ARGS="--schedule=versioning"
Description: This scheduler can handle multiple implementations for the same task and gives support to the
implements
clause. The scheduler automatically profiles each task implementation and chooses the most suitable implementation each time the task must be run. Each thread has its own task queue, where task implementations are added depending on scheduler’s decisions. The main criteria to assign a task to a thread is that it must be the earliest executor of that task (this means that the thread will finish task’s execution at an earliest time). In some cases (where the number of tasks in the graph is big enough), idle threads are also allowed to run task implementations even though they are not the fastest executors. Specifically, the profile for each task implementation includes its execution time and the data set size it uses and it is used to estimate the behavior of future executions of the same implementation. Versioning implements immediate successor by default.Parameters:
--versioning-stack, --no-versioning-stack Retrieving from queue’s back or queue’s front respectively. Using –versioning-stack parameters transform ready queue into a LIFO (Last In First Out) structure. Effectively, this means that tasks are inserted in queue’s back and retrieved from the same position (queue’s back). (default behaviour is –no-versioning-stack)
--versioning-use-stack, --no-versioning-use-stack This is an alias of the previous parameter.
Note
A detailed explanation of this scheduler can be found at the following conference paper: Judit Planas, Rosa Badia, Eduard Ayguade, Jesus Labarta, “Self-Adaptive OmpSs Tasks in Heterogeneous Environments”, Proceedings of 27th IEEE International Parallel and Distributed Processing Symposium (IEEE IPDPS) (2013).
3.3.1.8. Bottom level-aware scheduler¶
Configuration string:
NX_SCHEDULE=botlev
orNX_ARGS="--schedule=botlev"
Description: This scheduler targets single-ISA heterogeneous machines that maintain two kinds of cores (fast and slow, such as ARM big.LITTLE). The scheduler detects dynamically the longest path of the task dependency graph and assigns the tasks that belong to this path (critical tasks) to the fast cores of the system. The detection of the longest path is based on the computation and the usage of bottom-level longest-path priorities, that is the length of the longest path in the dependency chains from each node to a leaf node. There are two queues for the ready tasks, one per processor kind. Fast cores retrieve tasks from their unique queue and slow cores retrieve tasks from the other queue. The queues are sorted according to the task priority (bottom level). Work stealing is enabled by default for fast cores: a fast core steals a task from the slow-cores’ queue if it is idle. Optionally, work stealing can be performed by both sides if the parameter NX_STEALB is enabled. The policy can be flexible or strict, meaning that the flexible policy considers more tasks as critical, while the strict limits the number of critical tasks.
Parameters:
–update-freq or NX_BL_FREQ Sets the update frequency of the priorities. By default it is set to zero which means update every time a new dependency occurs. It can be set with any positive integer value and affects the priorities of the tasks. By setting this parameter to a positive value the bottom levels of the tasks are less accurate but the policy has slightly less scheduling overheads. –numSpins or NX_NUM_SPINS Sets the number of spins before attempting work stealing (by default it is set to 300). –from or NX_HP_FROM Sets the thread id of the first fast core. –to or NX_HP_TO Sets the thread id of the last fast core. –strict or NX_STRICTB Defines whether the policy is strict or flexible (–strict=0 for flexible or –strict=1 for strict). –steal or NX_STEALB Defines whether the policy uses uni- or bi-directional work stealing (–steal=0 for uni-directional or –steal=1 for bi-directional).