Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 10 of 12

New sparse matrix storage format to improve the performance of total SPMV time
(2012) Bayyapu, B.; Raghavendra, S.R.; Guddeti, G.
Graphics Processing Units (GPUs) are massive data parallel processors. High performance comes only at the cost of identifying data parallelism in the applications while using data parallel processors like GPU. This is an easy effort for applications that have regular memory access and high computation intensity. GPUs are equally attractive for sparse matrix vector multiplications (SPMV for short) that have irregular memory access. SPMV is an important computation in most of the scientific and engineering applications and scaling the performance, bandwidth utilization and compute intensity (ratio of computation to the data access) of SPMV computation is a priority in both academia and industry. There are various data structures and access patterns proposed for sparse matrix representation on GPUs and optimizations and improvements on these data structures is a continuous effort. This paper proposes a new format for the sparse matrix representation that reduces the data organization time and the memory transfer time from CPU to GPU for the memory bound SPMV computation. The BLSI (Bit Level Single Indexing) sparse matrix representation is up to 204% faster than COO (Co-ordinate), 104% faster than CSR (Compressed Sparse Row) and 217% faster than HYB (Hybrid) formats in memory transfer time from CPU to GPU. The proposed sparse matrix format is implemented in CUDA-C on CUDA (Compute Unified Device Architecture) supported NVIDIA graphics cards. © 2012 SCPE.
Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU
(John Wiley and Sons Ltd, 2015) Bayyapu, B.; Guddeti, R.M.R.; Raghavendra, P.S.
General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi-clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro-benchmark1 considered in this paper resulted in 10-40% and 80-92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively. © 2014 John Wiley & Sons, Ltd.
MPI + OpenCL implementation of a phase-field method incorporating CALPHAD description of Gibbs energies on heterogeneous computing platforms
(Elsevier, 2015) Tennyson, P.G.; Karthik, G.M.; Gandham, G.
Phase-field method uses a non-conserved order parameter to define the phase state of a system and is a versatile method for moving boundary problems. It is a method of choice for simulating microstructure evolution in the domain of materials engineering. Solution of phase-field evolution equations avoids explicit tracking of interfaces and is often implemented on a structured grid to capture microstructure evolution in a simple and elegant manner. Restrictions on the grid size to accurately capture the interface curvature effects lead to large number of grid points in the computational domain and render the simulation computationally intensive for realistic simulations in 3D. However, the availability of powerful heterogeneous computing platforms and super clusters provides the advantage to perform large scale phase-field simulations efficiently. This paper discusses a portable implementation to extend simulations across multiple CPUs using MPI to include use of GPUs using OpenCL. The solution scheme adapts an isotropic stencil that avoids grid-induced anisotropy. Use of separate OpenCL kernels for problem specific portions of the code ensure that the approach can be extended to different problems. Performance analysis of parallel strategies used in the study illustrate the massively parallel computing possibility for phase-field simulations across heterogeneous platforms. © 2014 Elsevier B.V. All rights reserved.
GPU implementation of non-local maximum likelihood estimation method for denoising magnetic resonance images
(Springer Verlag service@springer.de, 2017) Upadhya, A.H.K.; Talawar, B.; Rajan, J.
Magnetic resonance imaging (MRI) is a widely deployed medical imaging technique used for various applications such as neuroimaging, cardiovascular imaging and musculoskeletal imaging. However, MR images degrade in quality due to noise. The magnitude MRI data in the presence of noise generally follows a Rician distribution if acquired with single-coil systems. Several methods are proposed in the literature for denoising MR images corrupted with Rician noise. Amongst the methods proposed in literature for denoising MR images corrupted with Rician noise, the non-local maximum likelihood methods (NLML) and its variants are popular. In spite of the performance and denoising quality, NLML algorithm suffers from a tremendous time complexity O(m3N3) , where m3 and N3 represent the search window and image size, respectively, for a 3D image. This makes the algorithm challenging for deployment in the real-time applications where fast and prompt results are required. A viable solution to this shortcoming would be the application of a data parallel processing framework such as Nvidia CUDA so as to utilize the mutually exclusive and computationally intensive calculations to our advantage. The GPU-based implementation of NLML-based image denoising achieves significant speedup compared to the serial implementation. This research paper describes the first successful attempt to implement a GPU-accelerated version of the NLML algorithm. The main focus of the research was on the parallelization and acceleration of one computationally intensive section of the algorithm so as to demonstrate the execution time improvement through the application of parallel processing concepts on a GPU. Our results suggest the possibility of practical deployment of NLML and its variants for MRI denoising. © 2016, Springer-Verlag Berlin Heidelberg.
An efficient cost optimized scheduling for spot instances in heterogeneous cloud environment
(Elsevier B.V., 2018) Domanal, S.; Guddeti, G.
In this paper, we propose a novel efficient and cost optimized scheduling algorithm for a Bag of Tasks (BoT) on Virtual Machines (VMs). Further, in this paper, we use artificial Neural Network to predict the future values of Spot instances and then validate these predicted values with respect to the current (actual) values of Spot instances. On-Demand and Spot are the key instances which are procured by the cloud customers and hence, in this paper, we use these instances for the cost optimization. The key idea of our proposed algorithm is to efficiently utilize the cloud resources (mainly VMs instances, Central Processing Unit (CPU) and Memory) and also to optimize the cost of executing the BoT in the heterogeneous Infrastructure as a Service (IaaS) based cloud environment. Experimental results demonstrate that our proposed scheduling algorithm outperforms state-of-the-art benchmark algorithms (Round Robin, First Come First Serve, Ant Colony Optimization, Genetic Algorithm, etc.) in terms of Quality of Service (QoS) parameters (Reliability, Time and Cost) while executing the BoT in the heterogeneous cloud environment. Since the obtained results are in the form of ordinal, hence we carried out the statistical analysis on both predicted and actual Spot instances using the Spearman's Rho Test. © 2018 Elsevier B.V.
Parallel iterative hill climbing algorithm to solve TSP on GPU
(John Wiley and Sons Ltd, 2019) Yelmewad, P.; Talawar, B.
Traveling Salesman Problem (TSP) is an NP-hard combinatorial optimization problem. Heuristic algorithms provide satisfactory solutions to large instance TSP in a reasonable amount of time. However, heuristic methods result in suboptimal solutions as they do not cover the search space adequately. Sequential heuristic approaches spend significant CPU time in neighborhood generation for large input instances. Neighborhood generation time can be reduced by generating in parallel. GPUs have been shown to be effective in exploiting data and memory level parallelism in large complex problems. This work presents a GPU-based Parallel Iterative Hill Climbing (PIHC) algorithm using the nearest neighborhood heuristic to arrive at near-optimal solutions of large TSPLIB instances in a reasonable amount of time. Multiple construction heuristics approaches, thread mapping strategies, and data structures for TSPLIB instances have been evaluated. We demonstrate improved cost quality on symmetric TSPLIB instances up to 85,900 cities. The PIHC GPU implementation gives up to 193× speedup over its sequential counterpart and up to 979.96× speedup over a state-of-the-art GPU-based TSP solver. The PIHC implementation gives a cost quality with error rate 0.72% in the best case and 8.06% in the worst case. © 2018 John Wiley & Sons, Ltd.
GPU-aware resource management in heterogeneous cloud data centers
(Springer, 2021) Kulkarni, A.K.; Annappa, B.
The power of rapid scalability and easy maintainability of cloud services is driving many high-performance computing applications from company server racks into cloud data centers. With the evolution of Graphics Processing Units, composing of an extensive array of parallel computing single-instruction-multiple-data processors are being considered as a platform for high-performance computing because of their high throughput. Many cloud providers have begun offering GPU-enabled services for the users where GPUs are essential (for high computational power) to meet the desired Quality-of-service. Virtual machine placement and load balancing the GPUs in the virtualized environments like the cloud is still an evolving area of research and it is of prime importance to achieve higher resource efficiency and also to save energy. The current VM placement techniques do not consider the impact of VM workload type and GPU memory status on the VM placement decisions. This paper discusses the current issues with the First Fit policy of virtual machine placement used in VMWare Horizon and proposes a GPU-aware VM placement technique for GPU-enabled virtualized environments like cloud data centers. The experiments conducted using the synthetic workloads indicate reduction in the energy consumption, reduction in search space of physical hosts, and the makespan of the system. It also presents a summary of the current challenges for GPU resource management in virtualized environments and specific issues in developing cloud applications targeting GPUs under the virtualization layer. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
GPGPU-based randomized visual secret sharing (GRVSS) for grayscale and colour images
(Taylor and Francis Ltd., 2022) Holla, R.; Mhala, N.C.; Pais, A.R.
Visual Secret Sharing (VSS) is a technique used for sharing secret images between users. The existing VSS schemes reconstruct the original secret image as a halftone image with only a 50% contrast. The Randomized Visual Secret Sharing (RVSS) scheme overcomes the disadvantages of existing VSS schemes. Although RVSS extracts the secret image with better contrast, it is computationally expensive. This paper proposes a General Purpose Graphics Processing Unit (GPGPU)-based Randomized Visual Secret Sharing (GRVSS) technique that leverages data parallelism in the RVSS pipeline. The performance of the GRVSS is compared with the RVSS in a generic and PARAM Shavak supercomputer architecture. The GRVSS outperforms the RVSS in both architectures. © 2020 Informa UK Limited, trading as Taylor & Francis Group.
An Effective GPGPU Visual Secret Sharing by Contrast-Adaptive ConvNet Super-Resolution
(Springer, 2022) Holla, M.R.; Pais, A.R.
In this paper, we propose an effective secret image sharing model with super-resolution utilizing a Contrast-adaptive Convolution Neural Network (CCNN or CConvNet). The two stages of this model are the share generation and secret image reconstruction. The share generation step generates information embedded shadows (shares) equal to the number of participants. The activities involved in the share generation are to create a halftone image, create shadows, and transforming the image to the wavelet domain using Discrete Wavelet Transformation (DWT) to embed information into the shadows. The reconstruction stage is the inverse of the share generation supplemented with CCNN to improve the reconstructed image’s quality. This work is significant as it exploits the computational power of the General-Purpose Graphics Processing Unit (GPGPU) to perform the operations. The extensive use of memory optimization using GPGPU-constant memory in all the activities brings uniqueness and efficiency to the proposed model. The contrast-adaptive normalization between the CCNN layers in improving the quality during super-resolution impart novelty to our investigation. The objective quality assessment proved that the proposed model produces a high-quality reconstructed image with the SSIM of (89 - 99.8 %) for the noise-like shares and (71.6 - 90 %) for the meaningful shares. The proposed technique achieved a speedup of 800 × in comparison with the sequential model. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
High-performance medical image secret sharing using super-resolution for CAD systems
(Springer, 2022) Holla, M.R.; Pais, A.R.
Visual Secret Sharing (VSS) is a field of Visual Cryptography (VC) in which the secret image (SI) is distributed to a certain number of participants in the form of different encrypted shares. The decryption then uses authorized shares in a pre-defined manner to obtain that secret information. Medical image secret sharing (MISS) is an emerging VSS field to address the performance challenges in sharing medical images, such as efficiency and effectiveness. Here, we propose a novel MISS for the histopathological medical images to achieve high performance in these two parameters. The novelty here is the Graphics Processing Unit (GPU) to exploit the data-parallelism in MISS during encryption and super-resolution (SR), supplementing effectiveness with efficiency. A Convolution Neural Network (CNN) for SR produces a high-contrast reconstructed image. We evaluate the presented model using standard objective assessment parameters and the Computer-Aided Diagnosis (CAD) systems. The result analysis confirmed the high-performance of the proposed MISS with a 98% SSIM of the deciphered image. Compared with the state-of-art deep learning models designed for the histopathological medical images, MISS outperformed with 99.71% accuracy. Also, we achieved a categorization precision that fits the CAD systems. We attained an overall speedup of 800 × over the sequential model. This speedup is significant compared to the speedups of the benchmark GPGPU-based medical image reconstruction models. © 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results