Apache Iceberg for Longitudinal Patient Record Versioning in Cloud Data Lakes

Authors

  • Parth Jani Business SME/Product owner at Florida Blue, USA Author
  • Sangeeta Anand Senior Business System Analyst at Continental General, USA Author

Keywords:

Apache Iceberg, longitudinal patient records, healthcare data management

Abstract

Designed for the administration of large data lakes, Apache Iceberg is an open-source table format with great advantages in versioning, auditing & also data integrity. In healthcare, particularly with reference to cloud data lakes, maintaining exact historical records of longitudinal patient information is rather challenging. Maintaining appropriate audit trails depends on their proper versioning of patient data across time, which presents difficulties for healthcare providers ensuring their compliance with privacy and also security rules. Apache Iceberg helps to solve these problems by enabling more efficient management of big data with built-in versioning and incremental updates. This feature helps healthcare providers to track and preserve any changes to patient data, therefore guaranteeing data integrity and compliance to legal standards like HIPAA. Moreover, in complex data settings Iceberg's ACID transaction support guarantees consistency and correctness. The results show that Apache Iceberg guarantees that patient records are auditable, accurate, and reasonably versioned over time, therefore providing a consistent solution for healthcare data management. Including Apache Iceberg into cloud-based healthcare data lakes improves data governance, patient data monitoring, and safe, compliance audits of longitudinal records.

Downloads

Download data is not yet available.

References

Grossman, Robert L. "Data lakes, clouds, and commons: a review of platforms for analyzing and sharing genomic data." Trends in Genetics 35.3 (2019): 223-234.

Parente, Sara. "The design of a data lake architecture for the healthcare use case: problems and solutions." (2020).

Maini, Ekta, Bondu Venkateswarlu, and Arbind Gupta. "Data lake-an optimum solution for storage andanalytics of big data in cardiovascular disease prediction system." Int. J. Comput. Eng. Manag.(IJCEM) 21 (2018): 33-39.

Armbrust, Michael, et al. "Delta lake: high-performance ACID table storage over cloud object stores." Proceedings of the VLDB Endowment 13.12 (2020): 3411-3424.

Terrizzano, Ignacio G., et al. "Data Wrangling: The Challenging Yourney from the Wild to the Lake." CIDR. 2015.

Ramakrishnan, Raghu, et al. "Azure data lake store: a hyperscale distributed file service for big data analytics." Proceedings of the 2017 ACM International Conference on Management of Data. 2017.

Anusha Atluri. “The Security Imperative: Safeguarding HR Data and Compliance in Oracle HCM”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 7, no. 1, May 2019, pp. 90–104

Ravat, Franck, and Yan Zhao. "Data lakes: Trends and perspectives." Database and Expert Systems Applications: 30th International Conference, DEXA 2019, Linz, Austria, August 26–29, 2019, Proceedings, Part I 30. Springer International Publishing, 2019.

Gorelik, Alex. The enterprise big data lake: Delivering the promise of big data and data science. O'Reilly Media, 2019.

Abadi, Daniel, et al. "The Seattle report on database research." ACM Sigmod Record 48.4 (2020): 44-53.

Sangeeta Anand, and Sumeet Sharma. “Leveraging AI-Driven Data Engineering to Detect Anomalies in CHIP Claims”. Los Angeles Journal of Intelligent Systems and Pattern Recognition, vol. 1, Apr. 2021, pp. 35-55

Ali Asghar Mehdi Syed. “Impact of DevOps Automation on IT Infrastructure Management: Evaluating the Role of Ansible in Modern DevOps Pipelines”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 9, no. 1, May 2021, pp. 56–73

McPadden, Jacob, et al. "Health care and precision medicine research: analysis of a scalable data science platform." Journal of medical Internet research 21.4 (2019): e13043.

Vazirani, Anuraag A., et al. "Blockchain vehicles for efficient medical record management." NPJ digital medicine 3.1 (2020): 1.

Yasodhara Varma Rangineeni. “End-to-End MLOps: Automating Model Training, Deployment, and Monitoring”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 7, no. 2, Sept. 2019, pp. 60-76

Reen, Gaganjeet Singh, Manasi Mohandas, and S. Venkatesan. "Decentralized patient centric e-health record management system using blockchain and IPFS." 2019 IEEE conference on information and communication technology. IEEE, 2019.

Ali Asghar Mehdi Syed. “Cost Optimization in AWS Infrastructure: Analyzing Best Practices for Enterprise Cost Reduction”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 9, no. 2, July 2021, pp. 31-46

Evans, R. Scott. "Electronic health records: then, now, and in the future." Yearbook of medical informatics 25.S 01 (2016): S48-S61.

Anusha Atluri. “Extending Oracle HCM With APIs: The Developer’s Guide to Seamless Customization”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 8, no. 1, Feb. 2020, pp. 46–58

Manogaran, Gunasekaran, et al. "Big data knowledge system in healthcare." Internet of things and big data technologies for next generation healthcare (2017): 133-157.

Sun, Wencheng, et al. "Data processing and text mining technologies on electronic medical records: a review." Journal of healthcare engineering 2018.1 (2018): 4302425.

Downloads

Published

10-09-2021

How to Cite

[1]
P. Jani and S. Anand, “Apache Iceberg for Longitudinal Patient Record Versioning in Cloud Data Lakes”, Essex Journal of AI Ethics and Responsible Innovation, vol. 1, pp. 338–357, Sep. 2021, Accessed: May 31, 2026. [Online]. Available: https://ejaeai.org/index.php/publication/article/view/63