Designing Scalable Data Engineering Pipelines Using Azure and Databricks

Santosh Kumar Singu

Designing Scalable Data Engineering Pipelines Using Azure and Databricks

ESP Journal of Engineering & Technology Advancements

Volume 1 Issue 2

Year of Publication : 2021

Authors : Santosh Kumar Singu

: 10.56472/25832646/ESP-V1I2P119

Citation:

Santosh Kumar Singu, 2021. "Designing Scalable Data Engineering Pipelines Using Azure and Databricks", ESP Journal of Engineering & Technology Advancements, 1(2): 176-187.

Abstract:

Data engineering pipelines can be seen as the fundamental structure of today’s modern data-driven organizations, as they are responsible for processing large amounts of data and preparing it for analysis. Since today’s organizations are investing more in cloud solutions for their pipelines, these have to be scalable and flexible. The focus of this paper is the actual design of scalable data engineering pipelines using Microsoft Azure and Databricks as the two setup platforms in the handling of large-scale data operations. Azure is an advanced and highly scalable cloud solution that comes with such services as Azure Data Lake, Azure Synapse Analytics, and Azure Data Factory. However, Databricks provides a unified analytics data-n Architecture that assimilates with Azure and provides Apache Spark analytics and modish machine learning applications. Combined, the above technologies and methods form a strong compilation of data pipeline technologies and methods to be used by organizations in building highly scalable and efficient data processing pipelines that are not prone to bottlenecks. In this paper, the basic architectural concerns and elements needed to construct fault-tolerant pipelines are discussed. The subjects covered include data ingestion solutions, data storage using Azure Data Lake, real time processing with Databricks, and data management using Azure Data Factory. Particular emphasis is placed on data coherency, latency, as well as pipeline throughput. Other issues include scalability, which looks at the issues of managing large amounts of data, providing redundancy in the system and efficient resource usage in distributed systems. The interactions between Azure and Databricks are also discussed in detail and focus on the proper setting to have scalable and cost-optimal pipelines. In this paper, we consider an end-to-end process of constructing the scalable pipeline for realtime data analytics in the financial sector and demonstrate the approach and the results. A comparison and contrast of current batch processing pipelines and new realtime streaming pipelines is also presented. The paper ends with the prospective directions of development of the scalable data engineering concept and the ways organizations can expand the efficiency of the pipeline with the help of new tendencies such as serverless computing and artificial intelligence.

References:

[1] Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19(2), 171-209.

[2] Ghahramani, Z. (2015). Probabilistic Machine Learning and Artificial Intelligence. Nature, 521, 452-459.

[3] Zaharia, M., Chowdhury, M., Das, T., Dave, A., & Maheswaran, R. (2016). Spark: Cluster Computing with Working Sets. HotCloud, 2016.

[4] Marz, N., & Warren, J. (2015). Big Data: Principles and Best Practices of Scalable Realtime Data Systems. Manning Publications.

[5] Kreps, J., Narkhede, N., & Rao, J. (2011, June). Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (Vol. 11, No. 2011, pp. 1-7).

[6] Ahmed, F. F. (2015). Comparative analysis for cloud based e-learning. Procedia Computer Science, 65, 368-376.

[7] Nuckolls, R. (2020). Azure storage, streaming, and batch analytics: a guide for data engineers. Simon and Schuster.

[8] Munappy, A. R., Bosch, J., & Olsson, H. H. (2020). Data pipeline management in practice: Challenges and opportunities. In Product-Focused Software Process Improvement: 21st International Conference, PROFES 2020, Turin, Italy, November 25–27, 2020, Proceedings 21 (pp. 168-184). Springer International Publishing.

[9] Harper, K. E., Zheng, J., Jacobs, S. A., Dagnino, A., Jansen, A., Goldschmidt, T., & Marinakis, A. (2015, March). Industrial analytics pipelines. In 2015 IEEE First International Conference on Big Data Computing Service and Applications (pp. 242-248). IEEE.

[10] Devarasetty, N. (2018). Automating Data Pipelines with AI: From Data Engineering to Intelligent Systems. Revista de Inteligencia Artificial en Medicina, 9(1), 1-30.

[11] Kleppmann, M. (2017). Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. " O'Reilly Media, Inc.".

[12] Von Landesberger, T., Fellner, D. W., & Ruddle, R. A. (2016). Visualization system requirements for data processing pipeline design and optimization. IEEE Transactions on Visualization and Computer Graphics, 23(8), 2028-2041.

[13] Kukreja, M., & Zburivsky, D. (2021). Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way. Packt Publishing Ltd.

[14] Pala, S. K. (2021). Databricks Analytics: Empowering Data Processing, Machine Learning and Realtime Analytics. Machine Learning, 10(1).

[15] Patel, K., Sakaria, Y., & Bhadane, C. (2015). Real time data processing frameworks. Int. J. Data Min. Knowl. Manag. Process, 5(5), 49-63.

[16] Saxena, S., & Gupta, S. (2017). Practical realtime data processing and analytics: distributed computing and event processing using Apache Spark, Flink, Storm, and Kafka. Packt Publishing Ltd.

[17] Aziz, K., Zaidouni, D., & Bellafkih, M. (2018, April). Realtime data analysis using Spark and Hadoop. In 2018 4th International Conference on Optimization and Applications (ICOA) (pp. 1-6). IEEE.

[18] Ramakrishnan, R., Sridharan, B., Douceur, J. R., Kasturi, P., Krishnamachari-Sampath, B., Krishnamoorthy, K., ... & Venkatesan, R. (2017, May). Azure data lake store: a hyperscale distributed file service for big data analytics. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 51-63).

[19] Saed, K. A., Aziz, N., Ramadhani, A. W., & Hassan, N. H. (2018, August). Data governance cloud security assessment at data center. In 2018 4th International Conference on Computer and Information Sciences (ICCOINS) (pp. 1-4). IEEE.

[20] Kriegman, S., Blackiston, D., Levin, M., & Bongard, J. (2020). A scalable pipeline for designing reconfigurable organisms. Proceedings of the National Academy of Sciences, 117(4), 1853-1859.

[21] Santosh Kumar Singu, 2021. "Real-Time Data Integration: Tools, Techniques, and Best Practices", ESP Journal of Engineering & Technology Advancements 1(1): 158-172.

Keywords:

Data Engineering, Azure, Databricks, Scalable Pipelines, Cloud Computing, Apache Spark, Data Ingestion, Fault Tolerance.

ISSN : 2583-2646