Dynamic Checkpointing: Fault Tolerance in High-Performance Computing

dc.contributor.authorBhowmik, B.
dc.contributor.authorVerma, T.
dc.contributor.authorDineshbhai, N.D.
dc.contributor.authorReddy, M.R.V.
dc.contributor.authorGirish, K.K.
dc.date.accessioned2026-02-06T06:34:20Z
dc.date.issued2024
dc.description.abstractParallel computing has become a cornerstone of modern computational systems, enabling the rapid processing of complex tasks by utilizing multiple processors simultaneously. However, the efficiency and reliability of these systems can be significantly compromised by inherent challenges such as hardware failures, communication delays, and uneven workload distribution. These issues not only slow down computations but also threaten the dependability of applications reliant on parallel processing. To address these challenges, researchers have developed strategies like dynamic checkpointing and load balancing, which are crucial for enhancing fault tolerance and optimizing performance. Dynamic checkpointing periodically saves the computational state, allowing for recovery from failures without significant data loss, while load balancing ensures that tasks are evenly distributed across processors, preventing bottlenecks and underutilization of resources. By integrating these mechanisms, this paper proposes a robust framework that improves the reliability and efficiency of parallel systems, particularly in high-performance computing environments where the ability to handle large-scale data processing with minimal downtime is critical. © 2024 IEEE.
dc.identifier.citation3rd International Conference on Communication, Control, and Intelligent Systems, CCIS 2024, 2024, Vol., , p. -
dc.identifier.urihttps://doi.org/10.1109/CCIS63231.2024.10932122
dc.identifier.urihttps://idr.nitk.ac.in/handle/123456789/29195
dc.publisherInstitute of Electrical and Electronics Engineers Inc.
dc.subjectDynamic Checkpointing
dc.subjectFault Tolerance
dc.subjectHigh-Performance Computing (HPC)
dc.subjectLoad Balancing
dc.subjectParallel Computing
dc.titleDynamic Checkpointing: Fault Tolerance in High-Performance Computing

Files