ISSN : 2583-2646

Scalable PII Discovery across Mainframe, SAP, RDBMS & Unstructured Systems

ESP Journal of Engineering & Technology Advancements
© 2022 by ESP JETA
Volume 2  Issue 4
Year of Publication : 2022
Authors : Narasimha Chaitanya Samineni
: 10.56472/25832646/ESP-V2I4P127

Citation:

Narasimha Chaitanya Samineni , 2022. "Scalable PII Discovery across Mainframe, SAP, RDBMS & Unstructured Systems", ESP Journal of Engineering & Technology Advancements, 2(4): 179-191.

Abstract:

Enterprises increasingly operate heterogeneous data estates that span legacy mainframes, SAP ERP landscapes, relational database platforms, and rapidly growing unstructured repositories such as documents, emails, call transcripts, and collaboration content. In such environments, scalable discovery of personally identifiable information (PII) is foundational for privacy compliance, security risk reduction, and governance readiness. However, PII discovery at enterprise scale is challenging due to inconsistent data models, limited metadata in legacy systems, varied encodings and field semantics, and the complexity of detecting PII in unstructured content with acceptable accuracy and performance [4], [9]. This paper proposes a scalable, hybrid PII discovery framework that combines rule-based detection, metadata-driven inference, sampling strategies, and content analytics to identify and classify PII consistently across mainframe datasets, SAP tables, RDBMS, and unstructured systems. The framework integrates distributed scanning, centralized indexing, lineage-aware governance, and audit-grade reporting to improve discovery completeness and reduce operational effort. Evaluation outcomes demonstrate improved coverage, reduced false negatives in mixed environments, and practical performance characteristics suitable for large enterprise deployments.

References:

[1] California State Legislature, “California Consumer Privacy Act of 2018 (CCPA),” Civil Code §1798.100 et seq., 2018.

[2] European Union, “General Data Protection Regulation (GDPR),” Regulation (EU) 2016/679, 2018.

[3] NIST, Guide to Protecting the Confidentiality of Personally Identifiable Information (PII), NIST SP 800-122, 2010.

[4] NIST, Security and Privacy Controls for Information Systems and Organizations, NIST SP 800-53 Rev. 5, 2020.

[5] NIST, Privacy Framework: A Tool for Improving Privacy Through Enterprise Risk Management, Version 1.0, 2020.

[6] ISO/IEC 27018, Code of Practice for Protection of PII in Public Clouds Acting as PII Processors, ISO, 2019.

[7] ISO/IEC 27701, Extension to ISO/IEC 27001 and ISO/IEC 27002 for Privacy Information Management, ISO, 2019.

[8] A. Cavoukian, Privacy by Design: The 7 Foundational Principles, Information and Privacy Commissioner of Ontario, 2011.

[9] DAMA International, DAMA-DMBOK: Data Management Body of Knowledge, 2nd ed., Technics Publications, 2017.

[10] PCI Security Standards Council, PCI DSS: Requirements and Testing Procedures, v3.2.1, 2018.

[11] R. J. Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems, 2nd ed., Wiley, 2008.

[12] M. Bishop, Computer Security: Art and Science, 2nd ed., Addison-Wesley, 2018.

[13] D. Loshin, The Practitioner’s Guide to Data Quality Improvement, Morgan Kaufmann, 2010.

[14] Apache Tika Project, “Apache Tika: Content Analysis Toolkit,” Apache Software Foundation Documentation, 2021.

[15] Apache Spark Project, “Apache Spark: Unified Analytics Engine,” Apache Software Foundation Documentation, 2021.

[16] IBM, Data Governance and Privacy Management for Hybrid Cloud, IBM Redbooks, 2020.

[17] Microsoft, Data Protection and Privacy in Azure, Microsoft Documentation, 2021.

[18] Amazon Web Services, Data Protection and Privacy Best Practices, AWS Whitepaper, 2021.

[19] Oracle, Data Governance and Compliance for Enterprise Data Platforms, Oracle Documentation, 2021.

[20] SAP, Data Protection and Privacy Guide, SAP Documentation, 2021.

Keywords:

PII Discovery, Data Classification, Mainframe Data Governance, SAP Data Privacy, RDBMS Profiling, Unstructured Data Analytics, Content Parsing, Entity Extraction, Metadata Catalog, Data Lineage, Privacy Compliance, Data Inventory, Sensitive Data Detection, Regulatory Technology (RegTech), Security Controls.