ISSN : 2583-2646

Data Quality Framework-Using Great Expectations for ETL Pipelines

ESP Journal of Engineering & Technology Advancements
© 2025 by ESP JETA
Volume 5  Issue 1
Year of Publication : 2025
Authors : Sanjay Puthenpariyarath
:10.56472/25832646/JETA-V5I1P113

Citation:

Sanjay Puthenpariyarath, 2025. "Data Quality Framework-Using Great Expectations for ETL Pipelines", ESP Journal of Engineering & Technology Advancements  5(1): 106-112.

Abstract:

Despite Extract, Transform, Load (ETL) processed billions of records per day across most industries like finance, healthcare, and e-commerce, high data quality remains a heavy bottleneck. In this paper, we propose that in order to maintain a wide database base with Great Expectations (GX), an open-source Python tool, we need to utilize a scalable data quality framework. To remedy these, the framework tackles issues related to duplicate records, source bout table count disparities, column form validation, and null detection in essential columns. The solution eliminates the need for additional visualization tools by automating validation checks and by integrating GX into ETL workflows (ie: Apache Airflow, AWS Glue, Apache Spark), and it costs less than the visualizations tools. GX would then be compared to Deequ and Soda Core/Cloud from a perspective of flexibility, ease of integration, scalability. Implementation steps supported by the methodology are then validated through simulation of the value of an ETL workflow processing 100 million records, exhibiting significant improvements to pre and post processing data quality. The findings support GX as a tool to manage real world data quality issues and indicate ways in which the tool should be improved such as AI driven expectation generation.

References:

[1] S. Essien, "Implementing Cognitive-Behavior Coping Skills to Effect Sobriety Among Patients with Substance Use Disorder," Briar Cliff University, 2025.

[2] K. C. Buckwalter et al., "Iowa Model of Evidence-Based Practice: Revisions and Validation," Worldviews on Evidence-Based Nursing, vol. 14, no. 3, pp. 175-182, Jun. 2017, doi: 10.1111/wvn.12223.

[3] AWS Labs, "Deequ: Unit Tests for Data," GitHub Repository, 2022. [Online]. Available: https://github.com/awslabs/deequ

[4] Soda, "Soda Core Documentation," 2023. [Online]. Available: https://docs.soda.io/

[5] Great Expectations, "Great Expectations Documentation," 2023. [Online]. Available: https://greatexpectations.io/

[6] J. Smith et al., "Scalable Data Quality Frameworks for Big Data ETL," Proc. IEEE Int. Conf. Big Data, Dec. 2023, pp. 123-130, doi: 10.1109/BigData59044.2023.00015.

[7] J. Vanbuel, "Data Quality Libraries: The Right Fit," datamindedbe Medium, 2020. [Online]. Available: https://medium.com/datamindedbe/data-quality-libraries-the-right-fit-7e9a1c2b8f7c

[8] K. M. White, S. Dudley-Brown, and M. F. Terhaar, Translation of Evidence into Nursing and Health Care, 3rd ed., Springer Publishing, 2021.

[9] M. A. Al-Rajab et al., "Big Data Quality Framework: A Holistic Approach to Continuous Quality Management," Journal of Big Data, vol. 8, no. 1, pp. 1-20, Jan. 2021, doi: 10.1186/s40537-020-00391-7.

[10] R. Garcia et al., "Automated Data Validation in Distributed ETL Pipelines," Proc. ACM SIGMOD Int. Conf. Management of Data, Jun. 2022, pp. 456-463, doi: 10.1145/3318464.3386132.

[11] T. Sobotík, "Ensuring Data Quality with Great Expectations and Snowflake," Snowflake Builders Blog, 2021. [Online]. Available: https://www.snowflake.com/en/blog/ensuring-data-quality-great-expectations-snowflake/

[12] L. Chen et al., "Advances in Data Quality Management for Cloud-Based ETL Systems," Proc. IEEE Int. Conf. Big Data, Dec. 2024, doi: 10.1109/BigData60773.2024.00025.

Keywords:

Apache Airflow, AWS Glue, Apache Spark, Data Anomalies, Data Pipeline, Data Quality, Data Quality Framework, Data Quality Reports, Deequ, Expectation Suite, ETL, Great Expectations (GX), HTML reports, Soda Core/Cloud.