Browsing by Author "Prasad, B.M.P."

Now showing 1 - 3 of 3

Analysis of cache behaviour and software optimizations for faster on-chip network simulations
(2019) Prasad, B.M.P.; Parane, K.; Talawar, B.
Fast simulations are critical in reducing time to market in chip multiprocessors and system-on-chips. Several simulators have been used to evaluate the performance and power consumed by network-on-chips (NoCs). To speedup the simulations, it is necessary to investigate and optimize the hotspots in the simulator source code. Among several simulators available, Booksim2.0 has been chosen for the experimentation as it is being extensively used in the NoC community. In this paper, the cache and memory system behavior of Booksim2.0 have been analyzed to accurately monitor input dependent performance bottlenecks. The measurements show that cache and memory usage patterns vary widely based on the input parameters given to Booksim2.0. Based on these measurements, the cache configuration having the least misses has been identified. To further reduce the cache misses, software optimization techniques such as removal of unused functions, loop interchanging and replacing post-increment operator with pre-increment operator for non-primitive data types have been employed. The cache misses were reduced by 18.52%, 5.34% and 3.91% by employing above technology respectively. Thread parallelization and vectorization have been employed to improve the overall performance of Booksim2.0. The OpenMP programming model and SIMD are used for parallelizing and vectorizing the more time-consuming portions of Booksim2.0. Speedups of 2.93 and 3.97 were observed for the Mesh topology with 30 30 network size by employing thread parallelization and vectorization respectively. 2019, The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and Maintenance, Lulea University of Technology, Sweden.
Analysis of cache behaviour and software optimizations for faster on-chip network simulations
(Springer, 2019) Prasad, B.M.P.; Parane, K.; Talawar, B.
Fast simulations are critical in reducing time to market in chip multiprocessors and system-on-chips. Several simulators have been used to evaluate the performance and power consumed by network-on-chips (NoCs). To speedup the simulations, it is necessary to investigate and optimize the hotspots in the simulator source code. Among several simulators available, Booksim2.0 has been chosen for the experimentation as it is being extensively used in the NoC community. In this paper, the cache and memory system behavior of Booksim2.0 have been analyzed to accurately monitor input dependent performance bottlenecks. The measurements show that cache and memory usage patterns vary widely based on the input parameters given to Booksim2.0. Based on these measurements, the cache configuration having the least misses has been identified. To further reduce the cache misses, software optimization techniques such as removal of unused functions, loop interchanging and replacing post-increment operator with pre-increment operator for non-primitive data types have been employed. The cache misses were reduced by 18.52%, 5.34% and 3.91% by employing above technology respectively. Thread parallelization and vectorization have been employed to improve the overall performance of Booksim2.0. The OpenMP programming model and SIMD are used for parallelizing and vectorizing the more time-consuming portions of Booksim2.0. Speedups of 2.93× and 3.97× were observed for the Mesh topology with 30 × 30 network size by employing thread parallelization and vectorization respectively. © 2019, The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and Maintenance, Lulea University of Technology, Sweden.
FPGA friendly NoC simulation acceleration framework employing the hard blocks
(Springer, 2021) Prasad, B.M.P.; Parane, K.; Talawar, B.
A major role is played by Modeling and Simulation platforms in development of a new Network-on-Chip (NoC) architecture. The cycle accurate software simulators tend to become slow when simulating thousands of cores on a single chip. FPGAs have become the vehicle for simulation acceleration due to the properties of parallelism. Most of the state-of-the-art FPGA based NoC simulators utilize soft logic only for modeling the NoCs, leaving out the hard blocks to be unutilized. In this work, the FIFO Buffer and Crossbar switch functionalities of the NoC router have been embedded in the Block RAM (BRAMs) and the DSP48E1 slices with large multiplexer respectively. Employing the proposed techniques of mapping the NoC router components on the FPGA hard blocks, an NoC simulation acceleration framework based on the FPGA is presented in this work. A huge reduction in the use of the Configurable Logic Blocks (CLBs) has been observed when the FIFO buffer and Crossbar components of the NoC topology’s router micro-architecture are embedded in FPGA hard blocks. Our experimental results show that the topologies implemented employing the proposed FPGA friendly mapping of the NoC router components on the hard blocks consume 43.47% fewer LUTs and 41.66% fewer FFs than the topologies with CLB implementation. To optimize the latency of the NoC under consideration, a control unit called “buf_empty_checker” has been employed. A reduction in average latency has been observed compared to the CLB based topology implementation employing the proposed mapping. The proposed work consumes 10.88% fewer LUTs than the CONNECT NoC generation tool. Compared to DART, a reduction of 73.38% and 66.55% in LUTs and FFs has been observed with respect to the proposed work. The average packet latency of the proposed NoC architecture is 24.8% and 19.1% lesser than the CONNECT and DART architectures. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH, AT part of Springer Nature.