Dynamic Checkpointing: Fault Tolerance in High-Performance Computing

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Institute of Electrical and Electronics Engineers Inc.

Abstract

Parallel computing has become a cornerstone of modern computational systems, enabling the rapid processing of complex tasks by utilizing multiple processors simultaneously. However, the efficiency and reliability of these systems can be significantly compromised by inherent challenges such as hardware failures, communication delays, and uneven workload distribution. These issues not only slow down computations but also threaten the dependability of applications reliant on parallel processing. To address these challenges, researchers have developed strategies like dynamic checkpointing and load balancing, which are crucial for enhancing fault tolerance and optimizing performance. Dynamic checkpointing periodically saves the computational state, allowing for recovery from failures without significant data loss, while load balancing ensures that tasks are evenly distributed across processors, preventing bottlenecks and underutilization of resources. By integrating these mechanisms, this paper proposes a robust framework that improves the reliability and efficiency of parallel systems, particularly in high-performance computing environments where the ability to handle large-scale data processing with minimal downtime is critical. © 2024 IEEE.

Description

Keywords

Dynamic Checkpointing, Fault Tolerance, High-Performance Computing (HPC), Load Balancing, Parallel Computing

Citation

3rd International Conference on Communication, Control, and Intelligent Systems, CCIS 2024, 2024, Vol., , p. -

Endorsement

Review

Supplemented By

Referenced By