Conference Papers

Search Results

Now showing 1 - 2 of 2

Yatch: Leaderless, Fault Tolerant Consensus Protocol
(Institute of Electrical and Electronics Engineers Inc., 2022) Khaishagi, M.A.K.; Ananthanarayana, V.S.
Nowadays, with the advancement of computing power and faster internet, more and more applications are built where the machines are separated geographically apart, working together to give combined computing power faster than supercomputers and quick response time, better availability, and reliability. The machines have to coordinate to work together and provide coordination and agreement. Consensus protocols are used for coordination among geographically distant machines. The consensus protocols should be fast and simple. Protocols like Paxos, Raft, EPaxos, etc. which, solve the consensus problem in distributed systems. Generally, protocols are leader-based protocols that make them simpler, but leader machines can become the bottleneck in performance due to a single leader handling all communication. There are also leaderless protocols that solve the single leader problem but take more round trips. The number of roundtrips is an important criterion in distributed algorithms since it decides the speed and throughput of the algorithm. Distributed algorithms generally take more rounds in case of concurrent operations. This paper proposes a leaderless algorithm that takes two roundtrips in case of concurrent conflicting write operations. Â© 2022 IEEE.
Dynamic Checkpointing: Fault Tolerance in High-Performance Computing
(Institute of Electrical and Electronics Engineers Inc., 2024) Bhowmik, B.; Verma, T.; Dineshbhai, N.D.; Reddy, M.R.V.; Girish, K.K.
Parallel computing has become a cornerstone of modern computational systems, enabling the rapid processing of complex tasks by utilizing multiple processors simultaneously. However, the efficiency and reliability of these systems can be significantly compromised by inherent challenges such as hardware failures, communication delays, and uneven workload distribution. These issues not only slow down computations but also threaten the dependability of applications reliant on parallel processing. To address these challenges, researchers have developed strategies like dynamic checkpointing and load balancing, which are crucial for enhancing fault tolerance and optimizing performance. Dynamic checkpointing periodically saves the computational state, allowing for recovery from failures without significant data loss, while load balancing ensures that tasks are evenly distributed across processors, preventing bottlenecks and underutilization of resources. By integrating these mechanisms, this paper proposes a robust framework that improves the reliability and efficiency of parallel systems, particularly in high-performance computing environments where the ability to handle large-scale data processing with minimal downtime is critical. Â© 2024 IEEE.

Conference Papers

Browse

Filters

Settings

Sort By

Results per page

Search Results