"Enabling Scalable and Fine-Grained Nested Parallelism on Embedded Many-cores", in Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015 IEEE 9th International Symposium on 23-25 Sept. 2015, Turin, Italy
Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hierarchical interconnect and distributed memory with non-uniform access (NUMA). Nested parallelism is a convenient programming abstraction for large-scale cc-NUMA systems, which allows to hierarchically (and dynamically) create multiple levels of fine-grained parallelism whenever it is available. Available implementations for cc-NUMA systems introduce large overheads for nested parallelism management, which cannot be tolerated due to the extremely fine-grained nature of embedded parallel workloads. In particular, creating a team of parallel threads has a cost that increases linearly with the number of threads, which is inherently non scalable. This work presents a software cache mechanism for frequently-used parallel team configurations to speed up parallel thread creation overheads in PMCA systems. When a configuration is found in the cache the cost for parallel team creation has a constant time, providing a scalable mechanism. We evaluated our support on the STMicroelectronics STHORM many-core. Compared to the state-of-the art, our solution shows that: i) the cost for parallel team creation is reduced by up to 67%, ii) the tangible effect on real ultra-fine-grained parallel kernels is a speedup of up to 80%.