RL-Driven Scheduler for Always-On Batch Pipelines

Vasudevan Ananthakrishnan; Shemeer Sulaiman Kunju

Authors

Vasudevan Ananthakrishnan Yakshna Solutions Inc, USA Author
Shemeer Sulaiman Kunju HCL America Inc, USA Author

Keywords:

reinforcement learning, job scheduling, batch pipelines, SLA compliance, AutoSys, compute resource allocation

Abstract

The objective of this research is to introduce a reinforcement learning (RL)-based scheduling technique to improve compute slot allocation in continuous batch processing pipelines. Reinforcement learning (RL) agents simulate queue length fluctuations and evaluate service-level agreement (SLA) risks to dynamically allocate resources across parallel work queues in the proposed scheduler because RL is superior than rule-based systems like AutoSys in simulated high-stress conditions when workloads are high and resources are low, SLA breaches drop 25%.

Downloads

Download data is not yet available.

References

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., Cambridge, MA, USA: MIT Press, 2018.

V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

Y. Mao, J. Li, and M. Humphrey, “Resource management with deep reinforcement learning,” in Proc. 15th ACM Workshop on Hot Topics in Networks, 2016, pp. 50–56.

L. Chen, J. Li, and B. Li, “Reinforcement learning based job scheduling in cloud computing: A survey,” IEEE Transactions on Services Computing, vol. 13, no. 3, pp. 550–564, 2020.

A. Tang, X. Huang, and M. Zhang, “A deep reinforcement learning approach for job scheduling in distributed systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 8, pp. 1877–1890, 2020.

H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proc. 15th ACM Workshop on Hot Topics in Networks, 2016, pp. 50–56.

M. Zhang, Z. Yang, and T. Basar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” arXiv preprint arXiv:1911.10635, 2019.

J. C. M. Baeten, “Batch job scheduling in compute clusters: A survey and taxonomy,” ACM Computing Surveys, vol. 54, no. 7, pp. 1–35, 2021.

A. K. Singh and K. K. Agrawal, “A comparative study of heuristic scheduling algorithms in cloud computing,” Journal of Parallel and Distributed Computing, vol. 142, pp. 42–55, 2020.

B. Wieland et al., “AutoSys: A robust scheduling system for enterprise batch jobs,” IBM Journal of Research and Development, vol. 62, no. 1, pp. 1–10, 2018.

D. Zhang, Y. Chen, and Y. Chen, “Workload characterization and scheduling analysis of batch processing systems,” IEEE Transactions on Cloud Computing, vol. 9, no. 4, pp. 1325–1337, 2021.

J. Li, S. He, and X. Li, “Reinforcement learning for SLA-aware resource scheduling in cloud environments,” IEEE Transactions on Network and Service Management, vol. 17, no. 2, pp. 987–999, 2020.

K. M. Sim and D. Y. Yeong, “Job scheduling using reinforcement learning with delayed reward,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 37, no. 5, pp. 1043–1056, 2007.

M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in Proc. 34th International Conference on Machine Learning (ICML), 2017, pp. 449–458.

P. Balaprakash et al., “Autotuning batch workflows using reinforcement learning for scientific applications,” Future Generation Computer Systems, vol. 112, pp. 834–847, 2020.

S. J. Pan, Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.

M. Abdelfattah et al., “Adaptive resource management for data-intensive pipelines with reinforcement learning,” in Proc. IEEE International Conference on Big Data, 2022, pp. 455–464.

Y. Chen, J. Li, and Z. Chen, “Dynamic scheduling for heterogeneous workloads using deep reinforcement learning,” IEEE Transactions on Cloud Computing, vol. 11, no. 3, pp. 630–642, 2023.

H. Zhang, Z. Wang, and M. Wang, “A survey of deep reinforcement learning in resource management for cloud computing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 5, pp. 2066–2085, 2023.

RL-Driven Scheduler for Always-On Batch Pipelines

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite