Conference Papers
Permanent URI for this collectionhttps://idr.nitk.ac.in/handle/123456789/28506
Browse
13 results
Search Results
Item Resource aware scheduling in Hadoop for heterogeneous workloads based on load estimation(2013) Kapil, B.S.; Kamath S․, S.S.Currently, most cloud based applications require large scale data processing capability. Data to be processed is growing at a rate much faster than available computing power. Hadoop is used to enable distributed processing on large clusters of commodity hardware. In large clusters, the workloads may be heterogeneous in nature, that is, I/O bound, CPU bound or network intensive jobs that demand different types of resources requirement so as to run simultaneously on large cluster. Hadoops job scheduling is based on FIFO where, parallelization based on types of job has not been taken into account for scheduling. In this paper, we propose a new scheduling algorithm for Hadoop based distributed system, based on the classification of workloads to assign a specific category to a particular cluster according to current load of the cluster. The proposed scheduler increases the performance of both CPU and I/O resources in a cluster under heterogeneous workloads, by approximately 12% when compared to Hadoops FIFO scheduler. © 2013 IEEE.Item Performance analysis of graph based iterative algorithms on MapReduce framework(Institute of Electrical and Electronics Engineers Inc., 2014) Debbarma, A.; Annappa, B.; Mude, R.G.In the recent few years, there has been an enormous growth in the amount of digital data that is being produced. Numerous attempts are being made to process this large amount of data in a fast and effective manner. Hadoop MapReduce is one such software framework that has gained popularity in the last few years for distributed computation of Big Data. It provides a scalable, economical and easier way to process massive amounts of data in-parallel on large computing cluster preserving the properties of fault tolerance in a transparent manner. However, Hadoop always stores intermediate results to the local disk for running iterative jobs. As a result, Hadoop usually suffers from long execution runtimes for iterative jobs as it typically pays a high I/O cost, wasting CPU cycles and network bandwidth. This paper analyses the problems of existing Hadoop and compare its performance against iMapReduce and HaLoop for graph based iterative algorithms. HaLoop offers better performance as it stores intermediate results in cache and reuses those data on the next successive iteration. For using cache invariant data (inter-iteration locality) it schedules the tasks onto the same node that might occur in different iterations. © 2014 IEEE.Item Capturing Node Resource Status and Classifying Workload for Map Reduce Resource Aware Scheduler(Springer Verlag service@springer.de, 2015) Mude, R.G.; Betta, A.; Debbarma, A.There has been an enormous growth in the amount of digital data, and numerous software frameworks have been made to process the same. Hadoop MapReduce is one such popular software framework which processes large data on commodity hardware. Job scheduler is a key component of Hadoop for assigning tasks to node. Existing MapReduce scheduler assigns tasks to node without considering node heterogeneity, workload type, and the amount of available resources. This leads to overburdening of node by one type of job and reduces the overall throughput. In this paper, we propose a new scheduler which capture the node resource status after every heartbeat, classifies jobs into two types, CPU bound and IO bound, and assigns task to the node which is having less CPU/IO utilization. The experimental result shows an improvement of 15-20 % on heterogeneous and around 10 % of homogeneous cluster with respect to Hadoop native scheduler. © Springer India 2015.Item An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop(Institute of Electrical and Electronics Engineers Inc., 2015) Sahu, L.; Mohan, R.In this paper, we have proposed a novel K-means algorithm with modified Cosine Distance Measure for clustering of large datasets like Wikipedia latest articles and Reuters dataset. We are customizing Cosine Distance Measure for computing similarity between objects for improving cluster quality. Our method will calculate the similarity between objects by Cosine Distance Measure and then try to bring distance more closer by squaring the distance if it is between 0 to 0.5 else increase it. It will result in minimum Intra-cluster and maximizes Inter-cluster distance value. We are measuring cluster quality in term of Inter and Intra-cluster distances, good Feature weighting such as TF-IDF, Cluster Size and Top terms of the clusters. We have compared K-means algorithm by Cosine and modified Cosine Distance measure by setting performance metric such as Inter-cluster and Intra-cluster distances, Cluster size, Execution time etc. Our experimental result shows in minimizing Intra-cluster by 0.016% and maximizing Inter-cluster distance by 0.012%, reducing the cluster size by 1.5% and reducing sequence file size by 4%, that will result in good cluster quality. © 2014 IEEE.Item Analysis of MapReduce scheduling and its improvements in cloud environment(Institute of Electrical and Electronics Engineers Inc., 2015) D'Souza, S.; Chandrasekaran, K.MapReduce has become a prominent Parallel processing model used for analysing large scale data. MapReduce applications are increasingly being deployed in the cloud along with other applications sharing the same physical resources. In this scenario, efficient scheduling of MapReduce applications is of utmost importance. Also, MapReduce has to consider various other parameters like energy efficiency and meeting SLA goals besides achieving performance when executing jobs in cloud environments. In this work, we have classified MapReduce Scheduling as Cluster based Scheduling and Objective based Scheduling. We then summarize and analyse the different class of schedulers highlighting the strong points and limitations of each of the scheduling approaches. The Adaptive scheduling techniques provide dynamic resource management and meet performance goals. The Energy efficient scheduling techniques aim to cut data centre costs by using different approaches. Finally, we discuss the current challenges and future work. © 2015 IEEE.Item Workload characteristics and resource aware Hadoop scheduler(Institute of Electrical and Electronics Engineers Inc., 2015) Divya, M.; Annappa, B.Hadoop MapReduce is one of the largely used platforms for large scale data processing. Hadoop cluster has machines with different resources, including memory size, CPU capability and disk space. This introduces challenging research issue of improving Hadoop's performance through proper resource provisioning. The work presented in this paper focuses on optimizing job scheduling in Hadoop. Workload Characteristic and Resource Aware (WCRA) Hadoop scheduler is proposed, that classifies the jobs into CPU bound and Disk I/O bound. Based on the performance, nodes in the cluster are classified as CPU busy and Disk I/O busy. The amount of primary memory available in the node is ensured to be more than 25% before scheduling the job. Performance parameters of Map tasks such as the time required for parsing the data, map, sort and merge the result, and of Reduce task, such as the time to merge, parse and reduce is considered to categorize the job as CPU bound or Disk I/O bound. Tasks are assigned the priority based on their minimum Estimated Completion Time. The jobs are scheduled on a compute node in such a way that jobs already running on it will not be affected. Experimental results has given 30 % improvement in performance compared to Hadoop's FIFO, Fair and Capacity scheduler. © 2015 IEEE.Item Genome Data Analysis Using MapReduce Paradigm(Institute of Electrical and Electronics Engineers Inc., 2015) Pahadia, M.; Srivastava, A.; Srivastava, D.; Patil, N.Counting the number of occurences of a substringin a string is a problem in many applications. This paper suggests a fast and efficient solution for the field of bioinformatics. Ak-mer is a k-length sub string of a biological sequence. K-mercounting is defined as counting the number of occurences of all the possible k-mers in a biological sequence. K-mer counting has uses in applications ranging from error correction of sequencing reads, genome assembly, disease prediction and feature extraction. The current k-mer counting tools are both time and space costly. We provide a solution which uses MapReduce and Hadoop to reduce the time complexity. After applying the algorithms on real genome datasets, we concluded that the algorithm using Hadoopand MapReduce Paradigm runs more efficiently and reduces the time complexity significantly. © 2015 IEEE.Item Improved resource provisioning in Hadoop(Springer Science and Business Media Deutschland GmbH info@springer-sbm.com, 2016) Divya, M.; Annappa, B.Extensive use of the Internet is generating large amount of data. The mechanism to handle and analyze these data is becoming complicated day by day. The Hadoop platform provides a solution to process huge data on large clusters of nodes. Scheduler play a vital role in improving the performance of Hadoop. In this paper, MRPPR: MapReduce Performance Parameter based Resource aware Hadoop Scheduler is proposed. In MRPPR, performance parameters of Map task such as the time required for parsing the data, map, sort and merge the result, and of Reduce task, such as the time to merge, parse and reduce is considered to categorize the job as CPU bound, Disk I/O bound or Network I/O bound. Based on the node status obtained from the TaskTracker’s response, nodes in the cluster are classified as CPU busy, Disk I/O busy or Network I/O busy. A cost model is proposed to schedule a job to the node based on the classification to minimize the makespan and to attain effective resource utilization. A performance improvement of 25–30 % is achieved with our proposed scheduler. © Springer India 2016.Item Improving false alarm rate in intrusion detection systems using Hadoop(Institute of Electrical and Electronics Engineers Inc., 2016) Mukund, Y.R.; Nayak, S.S.; Chandrasekaran, K.Intrusion Detection Systems are a vital part of an organization's security. This paper gives an account of the existing algorithms for Intrusion Detection using Machine Learning, along with certain new ideas for improving the same. The paper mainly talks about employing the Decision Tree mechanism for Intrusion Detection and improve it with the distributed file system, Hadoop. Initially a method that uses a dirty-flags to check the consistency of the Decision Tree, which changes with every wrong classification of the system is employed. The wrong classification is identified by a certain user who informs the system about the same and helps it learn. In the further sections, a new method which does not use a dirty-flag, but rather modifies the Key-Value pair in the results of the reduce() function is tested as an improvement to the previous method. The two methods are compared, with the help of the Hadoop Simulation Tool - YARN. The main aim of the paper is to propose the use of the Distributed File System for Machine Learning along with some improvements to the current Hadoop File System, so that it reduces the total Time Taken, when Machine Learning algorithms are employed along with it. © 2016 IEEE.Item Frequent pattern mining on stream data using Hadoop CanTree-GTree(Elsevier B.V., 2017) Kusumakumari, V.; Sherigar, D.; Chandran, R.; Patil, N.The need for knowledge discovery from real-time stream data is continuously increasing nowadays and processing of transactions for mining patterns needs efficient data structures and algorithms. We propose a time-efficient Hadoop CanTree-GTree algorithm, using Apache Hadoop. This algorithm mines the complete frequent item sets (patterns) from real time transactions, by utilizing the sliding window technique. These are used to mine for closed frequent item sets and then, association rules are derived. It makes use of two data structures - CanTree and GTree. The results show that the Hadoop implementation of the algorithm performs 5 times better than in Java. © 2017 The Author(s).
