An Efficient Mapreduce Scheduler for Cloud Environment
Date
2020
Authors
Jeyaraj, Rathinaraja.
Journal Title
Journal ISSN
Volume Title
Publisher
National Institute of Technology Karnataka, Surathkal
Abstract
Hadoop MapReduce is one of the cost-effective ways to process a large volume of data
for reliable and effective decision-making. As on-premise Hadoop cluster is not affordable for short-term users, many public cloud service providers like Amazon, Google,
and Microsoft typically offer Hadoop MapReduce and relevant applications as a service via a cluster of virtual machines over the Internet. In general, these Hadoop virtual machines are launched in different physical machines across cloud data-center and
co-located with non-Hadoop virtual machines. It introduces many challenges, more
specifically, a layer of heterogeneities (hardware heterogeneity, virtual machine heterogeneity, performance heterogeneity, and workload heterogeneity) that impacts the
performance of MapReduce job and task scheduler. Containing physical servers of
different configuration and performance in cloud data-centers is called hardware heterogeneity. Existence of different size of virtual machines in a Hadoop virtual cluster
is called virtual machine heterogeneity. Hardware heterogeneity, virtual machine heterogeneity, and co-located non-Hadoop virtual machine’s interference together cause
varying performance for the same map/reduce task of a job. This is called performance
heterogeneity. Latest MapReduce versions allow users to customize the resource capacity (container size) for the map/reduce tasks of different jobs. This leads a batch of
MapReduce of jobs to be heterogeneous.
These heterogeneities are inevitable and profoundly affect the performance of MapReduce job and task scheduler concerning job latency, makespan, and virtual resource utilization. Therefore, it is essential to exploit these heterogeneities while offering Hadoop
MapReduce as a service to improve MapReduce scheduler performance in real-time.
Existing MapReduce job and task schedulers addressed some of these heterogeneities
but fell short in improving the performance. In order to improve these qualities of service further, we proposed a following set of methods: Dynamic Ranking-based MapReduce Job Scheduler (DRMJS) to exploit performance heterogeneity, Multi-Level Per
Node Combiner (MLPNC) to minimize the number of intermediate records in the shuffle phase, Roulette Wheel Scheme (RWS) based data block placement and a constrained
2-dimensional bin packing model to exploit virtual machine and workload level heteroigeneities, and Fine-Grained Data Locality Aware (FGDLA) job scheduling by extending MLPNC for a batch of jobs.
Firstly, DRMJS is proposed to improve MapReduce job latency and resource utilization by exploiting heterogeneous performance. The DRMJS calculates the performance
score for each Hadoop virtual machine based on CPU and Disk IO for map tasks, CPU
and Network IO for reduce tasks separately. Then, a rank list is prepared for scheduling
map tasks based on map performance score, and reduce tasks based on reduce performance score. Ultimately, DRMJS improved overall job latency, makespan, and resource
utilization up to 30%, 28%, and 60%, respectively, on average compared to existing
MapReduce schedulers. To improve job latency further, MLPNC is introduced to minimize the number of intermediate records in the shuffle phase, which is responsible for
the significant portion of MapReduce job latency. In general, each map task runs a dedicated combiner function to minimize the number of intermediate records. In MLPNC,
we split the combiner function from map task and run a single MLPNC in every Hadoop
virtual machine for a set of map tasks of the same job. These map tasks write its output
to the common MLPNC, which minimizes the number of intermediate records level
by level. Ultimately, MLPNC improved job latency up to 33% compared to existing
MapReduce schedulers for a single job. However, in production environment, a batch
of MapReduce jobs is periodically executed. Therefore, to extend MLPNC for a batch
of jobs, we introduced FGDLA job scheduler. Results showed that FGDLA minimized
the amount of intermediate data and makespan up to 62.1% and 32.4% when compared
to existing schedulers.
Secondly, virtual machine and workload level heterogeneities cause resource underutilization in the Hadoop virtual cluster and impact makespan for a batch of MapReduce
jobs. Considering this, we proposed RWS based data block placement, and a constrained 2-dimensional bin packing to place heterogeneous map/reduce tasks onto heterogeneous virtual machines. RWS places data blocks based on the processing capacity
of each virtual machine, and bin packing model helps to find the right combination of
map/reduce tasks of different jobs for each bin to improve makespan and resource utilization. The experimental results showed that the proposed model improved makespan
iiand resource utilization up to 57.9% and 59.3% over MapReduce fair scheduler.
Description
Keywords
Department of Information Technology, Bin Packing, Combiner, Heterogeneous Performance, Heterogeneous MapReduce Workloads, MapReduce Job Scheduler, MapReduce Task Placement