Dynamic Checkpointing: Fault Tolerance in High-Performance Computing

Bhowmik, B.; Verma, T.; Dineshbhai, N.D.; Reddy, M.R.V.; Girish, K.K.

Dynamic Checkpointing: Fault Tolerance in High-Performance Computing

Date

2024

Authors

Publisher

Institute of Electrical and Electronics Engineers Inc.

Abstract

Parallel computing has become a cornerstone of modern computational systems, enabling the rapid processing of complex tasks by utilizing multiple processors simultaneously. However, the efficiency and reliability of these systems can be significantly compromised by inherent challenges such as hardware failures, communication delays, and uneven workload distribution. These issues not only slow down computations but also threaten the dependability of applications reliant on parallel processing. To address these challenges, researchers have developed strategies like dynamic checkpointing and load balancing, which are crucial for enhancing fault tolerance and optimizing performance. Dynamic checkpointing periodically saves the computational state, allowing for recovery from failures without significant data loss, while load balancing ensures that tasks are evenly distributed across processors, preventing bottlenecks and underutilization of resources. By integrating these mechanisms, this paper proposes a robust framework that improves the reliability and efficiency of parallel systems, particularly in high-performance computing environments where the ability to handle large-scale data processing with minimal downtime is critical. Â© 2024 IEEE.

Keywords

Dynamic Checkpointing, Fault Tolerance, High-Performance Computing (HPC), Load Balancing, Parallel Computing

Citation

3rd International Conference on Communication, Control, and Intelligent Systems, CCIS 2024, 2024, Vol., , p. -

URI

https://doi.org/10.1109/CCIS63231.2024.10932122
https://idr.nitk.ac.in/handle/123456789/29195

Collections

Conference Papers

Full item page

Dynamic Checkpointing: Fault Tolerance in High-Performance Computing

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By