ISSN : 2583-2646

PII De-Identification Techniques for Healthcare Data Warehouses

ESP Journal of Engineering & Technology Advancements
© 2023 by ESP JETA
Volume 3  Issue 1
Year of Publication : 2023
Authors : Narasimha Chaitanya Samineni
:10.56472/25832646/JETA-V3I3P118

Citation:

Narasimha Chaitanya Samineni , 2023. "PII De-Identification Techniques for Healthcare Data Warehouses", ESP Journal of Engineering & Technology Advancements 3(1): 229-236.

Abstract:

Healthcare data warehouses consolidate electronic health records, claims, lab systems, and operational data to support analytics, quality improvement, and research. However, these warehouses often contain protected health information (PHI) and other personally identifiable information (PII) that can be linked across sources, making privacy risk higher than in single-system datasets. This article presents a practical, risk-based framework for de-identifying healthcare warehouse data while preserving analytic utility. The framework aligns regulatory expectations for de-identification, emphasizes realistic threat models, and maps common privacy techniques to healthcare warehouse patterns such as longitudinal patient linkage, cohort discovery, and machine learning feature stores. We describe a technique taxonomy spanning masking, suppression, pseudonymization and tokenization, generalization-based privacy models (k-anonymity family), distributional protections (t-closeness), and differential privacy for aggregate outputs. We then propose a layered warehouse architecture separating a restricted PHI vault from an analytics-ready de-identification zone, supported by strong access control, key management, and auditing. Finally, we provide utility and quality validation methods, operational governance controls, and an implementation blueprint to help organizations deploy de-identification as an engineered capability rather than a one-time data export step. [1][2][4][5][8]

References:

[1] U.S. Department of Health and Human Services (HHS), Office for Civil Rights, “Guidance on De-identification of Protected Health Information,” Nov. 26, 2012.

[2] U.S. Government Publishing Office, “45 CFR § 164.514: Standard: De-identification of protected health information (HIPAA Privacy Rule),” Electronic Code of Federal Regulations, accessed 2022.

[3] European Union, “Regulation (EU) 2016/679 (General Data Protection Regulation),” Official Journal of the European Union, 2016.

[4] ISO/IEC, “ISO/IEC 20889:2018 Privacy enhancing data de-identification terminology and classification of techniques,” International Organization for Standardization, 2018.

[5] L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, vol. 10, no. 5, pp. 557–570, 2002.

[6] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “l-Diversity: Privacy beyond k-anonymity,” ACM Transactions on Knowledge Discovery from Data, vol. 1, no. 1, 2007.

[7] N. Li, T. Li, and S. Venkatasubramanian, “t-Closeness: Privacy beyond k-anonymity and l-diversity,” in Proc. IEEE International Conference on Data Engineering (ICDE), 2007.

[8] C. Dwork, “Differential privacy,” in Proc. International Colloquium on Automata, Languages and Programming (ICALP), 2006.

[9] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Proc. Theory of Cryptography Conference (TCC), 2006.

[10] A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets,” in Proc. IEEE Symposium on Security and Privacy, 2008.

[11] K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A systematic review of re-identification attacks on health data,” PLoS ONE, vol. 6, no. 12, e28071, 2011.

[12] P. Ohm, “Broken promises of privacy: Responding to the surprising failure of anonymization,” UCLA Law Review, vol. 57, pp. 1701–1777, 2010.

[13] K. El Emam, Guide to the De-Identification of Personal Health Information. Boca Raton, FL, USA: Auerbach Publications, 2013.

[14] National Institute of Standards and Technology (NIST), “SP 800-38G: Recommendation for block cipher modes of operation: Methods for format-preserving encryption,” 2016.

[15] B. C. M. Fung, K. Wang, and P. S. Yu, “Privacy-preserving data publishing: A survey of recent developments,” ACM Computing Surveys, vol. 42, no. 4, 2010.

[16] L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York, NY, USA: Springer, 2001.

[17] D. B. Rubin, “Statistical disclosure limitation,” Journal of Official Statistics, vol. 9, no. 2, pp. 461–468, 1993.

[18] T. Dalenius, “Towards a methodology for statistical disclosure control,” Statistik Tidskrift, vol. 15, pp. 429–444, 1977.

[19] S. N. Murphy et al., “Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2),” Journal of the American Medical Informatics Association (JAMIA), 2010.

[20] E. A. Voss et al., “Feasibility and utility of applications of the common data model to multiple, disparate observational health databases,” Journal of the American Medical Informatics Association (JAMIA), vol. 22, no. 3, pp. 553–564, 2015.

Keywords:

Healthcare Data Warehouse, PHI, PII, De-identification, Tokenization, k-anonymity, Differential Privacy, Re-identification Risk, Governance. [1][4][5][8]