A Comprehensive Review on Scaling Machine Learning Workflows Using Cloud Technologies and DevOps

Scaling Machine Learning (ML) workflows in cloud environments presents critical challenges in ensuring reproducibility, low-latency inference, infrastructure reliability, and regulatory compliance. This review addresses the lack of a comprehensive synthesis of how integrated DevOps practices and cloud-native technologies enable scalable, production-grade ML systems. We analyze the convergence of MLOps with tools such as Kubernetes, Jenkins, and Terraform, detailing their role in automating CI/CD pipelines, infrastructure provisioning, and model lifecycle management. The main highlights strategies for optimizing resource utilization, minimizing inference latency, and managing data versioning across hybrid and multi-cloud architectures (AWS, Azure, GCP). We also examine serverless computing, container orchestration, and monitoring practices to enhance scalability and governance. By categorizing challenges chronologically and evaluating emerging practices such as federated learning and security-by-design, this work bridges a key gap in existing literature. It offers a unified perspective on building reliable, reproducible, and compliant ML workflows, thereby advancing the state of scalable AI system engineering. Â© IEEE. 2013 IEEE.

Keywords

automation, cloud computing, DevOps, Kubernetes, Machine learning (ML) workflows, MLOps, scalability

Citation

IEEE Access, 2025, Vol.13, , p.148559 -148594

URI

https://doi.org/10.1109/ACCESS.2025.3599281
https://idr.nitk.ac.in/handle/123456789/34157

Collections

Miscellaneous Publications

Full item page

A Comprehensive Review on Scaling Machine Learning Workflows Using Cloud Technologies and DevOps

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By