A Comprehensive Review on Scaling Machine Learning Workflows Using Cloud Technologies and DevOps

Ramesh, G.Vaikunta Pai, T.Birǎu, R.Poojary, K.K.AbhayShingad, A.R.Sowjanya, N.Popescu, V.Mitroi, A.T.Nioata, R.M.Kiran Raj, K.M.2026-02-052025IEEE Access, 2025, Vol.13, , p. 148559-148594https://doi.org/10.1109/ACCESS.2025.3599281https://idr.nitk.ac.in/handle/123456789/28227Scaling Machine Learning (ML) workflows in cloud environments presents critical challenges in ensuring reproducibility, low-latency inference, infrastructure reliability, and regulatory compliance. This review addresses the lack of a comprehensive synthesis of how integrated DevOps practices and cloud-native technologies enable scalable, production-grade ML systems. We analyze the convergence of MLOps with tools such as Kubernetes, Jenkins, and Terraform, detailing their role in automating CI/CD pipelines, infrastructure provisioning, and model lifecycle management. The main highlights strategies for optimizing resource utilization, minimizing inference latency, and managing data versioning across hybrid and multi-cloud architectures (AWS, Azure, GCP). We also examine serverless computing, container orchestration, and monitoring practices to enhance scalability and governance. By categorizing challenges chronologically and evaluating emerging practices such as federated learning and security-by-design, this work bridges a key gap in existing literature. It offers a unified perspective on building reliable, reproducible, and compliant ML workflows, thereby advancing the state of scalable AI system engineering. © IEEE. 2013 IEEE.automationcloud computingDevOpsKubernetesMachine learning (ML) workflowsMLOpsscalabilityA Comprehensive Review on Scaling Machine Learning Workflows Using Cloud Technologies and DevOps