Implement a more asynchronous flow for worker threads
The current scheme is too constrained for CudaThreads and probably ClusterThreads. A more event-based scheme is need to overlap things better.
The current scheme is too constrained for CudaThreads and probably ClusterThreads. A more event-based scheme is need to overlap things better.