A Comprehensive Review on Scaling Machine Learning Workflows Using Cloud Technologies and DevOps
No Thumbnail Available
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Institute of Electrical and Electronics Engineers Inc.
Abstract
Scaling Machine Learning (ML) workflows in cloud environments presents critical challenges in ensuring reproducibility, low-latency inference, infrastructure reliability, and regulatory compliance. This review addresses the lack of a comprehensive synthesis of how integrated DevOps practices and cloud-native technologies enable scalable, production-grade ML systems. We analyze the convergence of MLOps with tools such as Kubernetes, Jenkins, and Terraform, detailing their role in automating CI/CD pipelines, infrastructure provisioning, and model lifecycle management. The main highlights strategies for optimizing resource utilization, minimizing inference latency, and managing data versioning across hybrid and multi-cloud architectures (AWS, Azure, GCP). We also examine serverless computing, container orchestration, and monitoring practices to enhance scalability and governance. By categorizing challenges chronologically and evaluating emerging practices such as federated learning and security-by-design, this work bridges a key gap in existing literature. It offers a unified perspective on building reliable, reproducible, and compliant ML workflows, thereby advancing the state of scalable AI system engineering. © IEEE. 2013 IEEE.
Description
Keywords
automation, cloud computing, DevOps, Kubernetes, Machine learning (ML) workflows, MLOps, scalability
Citation
IEEE Access, 2025, Vol.13, , p.148559 -148594
