Faculty Publications

Permanent URI for this communityhttps://idr.nitk.ac.in/handle/123456789/18736

Publications by NITK Faculty

Browse

Search Results

Now showing 1 - 2 of 2

Multi-level per node combiner (MLPNC) to minimize mapreduce job latency on virtualized environment
(Association for Computing Machinery acmhelp@acm.org, 2018) Jeyaraj, R.; Ananthanarayana, V.S.
Big data drove businesses and researches more data driven. Hadoop MapReduce is one of the cost-effective ways for processing huge amount of data and also offered as a service from cloud on cluster of Virtual Machines (VM). In Cloud Data Center (CDC), Hadoop VMs are co-located with other general purpose VMs across racks. Such a multi-tenancy leads to varying local network bandwidth availability for Hadoop VMs, which directly impacts MapReduce job latency. Because, shuffle phase in MapReduce execution sequence itself contributes 26%-70% of overall job latency due to large number of intermediate records. Therefore, Hadoop virtual cluster requires to ensure a maximum bandwidth to minimize job latency, but, it also increases the bandwidth usage cost. In this paper, we propose "Multi-Level Per Node Combiner" (MLPNC) that curtails the number of intermediate records in shuffle phase resulting to reduction in overall job latency. It also minimizes bandwidth usage cost as well. We evaluate MLPNC results on wordcount job against default combiner, and Per Node Combiner (PNC). We also discuss the results based on number of shuffled records, shuffle latency, average merge latency, average reduce latency, average reduce task start time, and overall job latency. Finally, we argue in favor of MLPNC as it achieves up to 33% reduction in number of intermediate records and up to 32% reduction in average job latency than PNC. Â© 2018 ACM.
Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment
(Springer Science and Business Media Deutschland GmbH info@springer-sbm.com, 2020) Jeyaraj, R.; Ananthanarayana, V.S.; Paul, A.
Big data overwhelmed industries and research sectors. Reliable decision making is always a challenging task, which requires cost-effective big data processing tools. Hadoop MapReduce is being used to store and process huge volume of data in a distributed environment. However, due to huge capital investment and lack of expertise to set up an on-premise Hadoop cluster, big data users seek cloud-based MapReduce service over the Internet. Mostly, MapReduce on a cluster of virtual machines is offered as a service for a pay-per-use basis. Virtual machines in MapReduce virtual cluster reside in different physical machines and co-locate with other non-MapReduce VMs. This causes to share IO resources such as disk and network bandwidth, leading to congestion as most of the MapReduce jobs are disk and network intensive. Especially, the shuffle phase in MapReduce execution sequence consumes huge network bandwidth in a multi-tenant environment. This results in increased job latency and bandwidth consumption cost. Therefore, it is essential to minimize the amount of intermediate data in the shuffle phase rather than supplying more network bandwidth that results in increased service cost. Considering this objective, we extended multi-level per node combiner for a batch of MapReduce jobs to improve makespan. We observed that makespan is improved up to 32.4% by minimizing the number of intermediate data in shuffle phase when compared to classical schedulers with default combiners. © 2020, Springer-Verlag GmbH Germany, part of Springer Nature.

Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results