| ESP Journal of Engineering & Technology Advancements |
| © 2026 by ESP JETA |
| Volume 6 Issue 2 |
| Year of Publication : 2026 |
| Authors : Ankush Jitendrakumar Tyagi |
:10.5281/zenodo.20344300 |
Ankush Jitendrakumar Tyagi, 2026. Compiler-Assisted Performance Optimization of Large-Scale Machine Learning Pipelines Using MLIR-Based Volume 6 Issue 2: 149-157.
The recent progress in the large-scale machine learning (ML) systems has increased the need for highly efficient and scalable computational pipelines which are highly efficient and scalable to run across heterogeneous hardware platforms. Compiler-assisted optimization is a feature that has become a key factor of enhancing performance, portability and energy efficiency in these systems.
[1] Chen, T., Moreau, T., Jiang, Z., Shen, H., Yan, E., Wang, L., Zhu, Y., Liu, Y., Krishnan, S., Wang, Y., Gao, M., Wu, T., Zheng, L., Yan, E., Jiang, Z., Ceze, L., Guestrin, C., & Krishnamurthy, A. (2018). TVM: An automated end-to-end optimizing compiler for deep learning. Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 578–594.
[2] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., & Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 265–283.
[3] Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., & Vasilache, N. (2021). MLIR: Scaling compiler infrastructure for domain-specific computation. Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 1–12.
[4] Moses, W., Churavy, V., Paine, T., Churavy, J., & Tatlock, Z. (2020). Enzyme: High-performance automatic differentiation of LLVM. Proceedings of the ACM on Programming Languages, 4(OOPSLA), 1–30.[5] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L., Rothchild, D., So, D., Texier, M., & Dean, J. (2022). The carbon footprint of machine learning training will plateau, then shrink. Computer, 55(7), 18–28.
[5] Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L., Rothchild, D., So, D., Texier, M., & Dean, J. (2022). The carbon footprint of machine learning training will plateau, then shrink. Computer, 55(7), 18–28.
[6] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54–63.
[7] Baghdadi, R., Ray, J., Romdhane, M. B., Del Sozzo, E., & Cohen, A. (2019). Tiramisu: A polyhedral compiler for expressing fast and portable code. Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 193–205.
[8] Zheng, L., Jia, Z., Sun, M., Wu, F., & Chen, J. (2020). Ansor: Generating high-performance tensor programs for deep learning. Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 863–879.
[9] Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia, M., & Aiken, A. (2019). TASO: Optimizing deep learning computation with automatic generation of graph substitutions. Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 47–62.
[10] Weng, J., Jain, A., Wang, J., Wang, L., Wang, Y., & Nowatzki, T. (2021). UNIT: Unifying tensorized instruction compilation. Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 77–89.
[11] Zhu, H., Wu, R., Diao, Y., Ke, S., Li, H., Zhang, C., Xue, J., Ma, L., Xia, Y., Cui, W., Yang, F., Yang, M., Zhou, L., Cidon, A., & Pekhimenko, G. (2022). ROLLER: Fast and efficient tensor compilation for deep learning. Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 233–248.
[12] Zhao, J., Gao, X., Xia, R., Zhang, Z., Chen, D., Chen, L., Zhang, R., Geng, Z., Cheng, B., & Jin, X. (2022). Apollo: Automatic partition-based operator fusion through layer-by-layer optimization. Proceedings of Machine Learning and Systems, 4, 1–19.
[13] Bik, A. J. C., Koanantakool, P., Shpeisman, T., Vasilache, N., Zheng, B., & Kjolstad, F. (2022). Compiler support for sparse tensor computations in MLIR. ACM Transactions on Architecture and Code Optimization, 19(4), 1–25.
[14] Katel, N., Khandelwal, V., & Bondhugula, U. (2022). MLIR-based code generation for GPU tensor cores. Proceedings of the ACM SIGPLAN International Conference on Compiler Construction (CC), 1–12.
[15] Jeon, B., Park, S., Liao, P., Xu, S., Chen, T., & Jia, Z. (2022). Collage: Seamless integration of deep learning backends with automatic placement. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 517–529.
[16] Feng, S., Hou, B., Jin, H., Lin, W., Shao, J., Lai, R., Ye, Z., Zheng, L., Yu, C. H., Yu, Y., & Chen, T. (2023). TensorIR: An abstraction for automatic tensorized program optimization. Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 804–817.
[17] Bao, G., Shi, H., Cui, C., Zhang, Y., & Yao, J. (2024). UFront: Toward a unified MLIR frontend for deep learning. Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), 255–267.
[18] Lücke, M. P., Zinenko, O., Moses, W. S., Steuwer, M., & Cohen, A. (2025). The MLIR transform dialect: Your compiler is more powerful than you think. Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 241–254.
[19] Vasilache, N., Zinenko, O., Theodoridis, G., & Cohen, A. (2022). Composable and modular code generation with MLIR. ACM Transactions on Architecture and Code Optimization, 19(3), 1–24.
[20] Cummins, C., Petoumenos, P., Wang, Z., & Leather, H. (2017). End-to-end deep learning of optimization heuristics. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 219–232.
[21] Chen, T., Zheng, L., Yan, E., Jiang, Z., Moreau, T., Ceze, L., & Guestrin, C. (2018). Learning to optimize tensor programs. Advances in Neural Information Processing Systems, 31, 3389–3400.
[22] Larsen, S., & Amarasinghe, S. (2000). Exploiting superword level parallelism with multimedia instruction sets. ACM SIGPLAN Notices, 35(5), 145–156.
[23] Leary, C., & Wang, T. (2017). XLA: TensorFlow, compiled. TensorFlow Dev Summit, 1–5.
[24] Rotem, N., Fix, J., Abdulrasool, S., Catron, D., Deng, J., Dzhabarov, R., & Zolotov, E. (2018). Glow: Graph lowering compiler techniques for neural networks. Proceedings of the International Conference on Machine Learning Systems (MLSys).
[25] Vanholder, H. (2016). Efficient inference with TensorRT. Proceedings of the GPU Technology Conference (GTC).
[26] Google. (2021). IREE: MLIR-based machine learning compiler and runtime. Proceedings of the IEEE International Symposium on Workload Characterization (IISWC).
[27] Cyphers, S., et al. (2018). Intel nGraph: An intermediate representation, compiler, and executor for deep learning. Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[28] Schaarschmidt, M., et al. (2019). PlaidML: A portable tensor compiler. Proceedings of Machine Learning and Systems, 1, 1–12.
[29] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L. C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., & Le, Q. V. (2019). Searching for MobileNetV3. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1314–1324.
[30] Cummins, C., Petoumenos, P., Wang, Z., & Leather, H. (2017). DeepTune: End-to-end deep learning for program optimization. Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 234–246.
[31] Haj-Ali, A., Moses, W., Kamil, S., & Williams, S. (2020). Learning to optimize halide with tree search and random programs. ACM Transactions on Architecture and Code Optimization, 17(4), 1–25.
[32] Fei, Y., Wu, C., & Wang, Z. (2021). Bridging deep learning frameworks and hardware with standardized compiler infrastructures. IEEE Transactions on Parallel and Distributed Systems, 32(12), 3003–3016.
[33] Horowitz, M. (2014). Computing’s energy problem (and what can be done about it). IEEE International Solid-State Circuits Conference (ISSCC), 10–14.
[34] Narayanan, D., Phanishayee, A., Shi, K., Chen, X., & Zaharia, M. (2021). Memory-efficient pipeline parallelism for large-scale neural network training. Proceedings of Machine Learning and Systems, 3, 1–15.
[35] Mattson, T. G., Sanders, B. A., & Massingill, B. L. (2004). Patterns for parallel programming. Addison-Wesley.
[36] Bondhugula, U., et al. (2008). Pluto: A practical and fully automatic polyhedral program optimization system. Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 101–113.
[37] Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
[38] Dean, J., et al. (2012). Large scale distributed deep networks. Advances in Neural Information Processing Systems, 25, 1223–1231.
[39] Jouppi, N. P., et al. (2017). In-datacenter performance analysis of a tensor processing unit. Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 1–12.
Compiler Optimization, MLIR; Machine Learning Pipelines, Performance Optimization, Heterogeneous Computing, Tensor Compilation