HyperLogLog-Based Compliance Coverage Estimation for Distributed Datasets

Chiranjeevi Devi; Nithin Vunnam; Jawaharbabu Jeyaraman

Authors

Chiranjeevi Devi LinkedIn Corp, USA Author
Nithin Vunnam Cardinal Health, USA Author
Jawaharbabu Jeyaraman TransUnion, USA Author

Keywords:

HyperLogLog, compliance estimation, differential privacy, data governance, multicloud environments

Abstract

HyperLogLog (HLL) drawings anticipate how many records are susceptible to regulatory constraints in big, distributed datasets to evaluate compliance coverage. Traditional methods for detecting and counting regulated documents in data systems with thousands of tables spanning numerous clouds require extensive data scans and disclose sensitive identities, making them expensive and risky. Minimal HLL representations of compliance-relevant record sets are possible. It speeds up set union operations across datasets without affecting data privacy. Inferential assaults and cardinality estimations are hidden by differential privacy masking. It creates probabilistic confidence intervals for policy-aware decision-making. Real-time compliance dashboards prioritized maintenance in the experiment. They may be accurate within statistical error margins and work in resource-limited contexts. This project improves data governance architecture compliance analytics scalability and privacy.

Downloads

Download data is not yet available.

References

P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier, “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm,” in Discrete Mathematics and Theoretical Computer Science, vol. 8, no. 1, pp. 127–146, 2007.

S. Heule, M. Nunkesser, and A. Hall, “HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm,” in Proc. 16th Int. Conf. Extending Database Technology (EDBT), 2013, pp. 683–692.

G. Cormode and M. N. Garofalakis, “Sketching streams: A review,” Found. Trends Theor. Comput. Sci., vol. 4, no. 1–3, pp. 1–293, 2012.

C. Dwork, “Differential privacy,” in Proc. 33rd Int. Colloquium on Automata, Languages and Programming (ICALP), 2006, pp. 1–12.

D. Kifer and A. Machanavajjhala, “No free lunch in data privacy,” in Proc. ACM SIGMOD Int. Conf. Management of Data, 2011, pp. 193–204.

M. Bilenko, “Scalable and privacy-aware data management for compliance analytics,” in Proc. IEEE Int. Conf. Big Data, 2018, pp. 1110–1119.

J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.

J. Beame, P. Koutris, and D. Suciu, “Communication steps for parallel query processing,” in Proc. ACM PODS, 2013, pp. 273–284.

R. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS, 2017, pp. 1273–1282.

A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets,” in Proc. IEEE Symp. Security and Privacy, 2008, pp. 111–125.

J. Mirkovic and P. Reiher, “A taxonomy of DDoS attack and DDoS defense mechanisms,” ACM SIGCOMM Computer Communications Review, vol. 34, no. 2, pp. 39–53, 2004.

F. McSherry, K. Talwar, and A. Thakurta, “Accuracy vs. privacy: A learning theory perspective,” in Proc. IEEE FOCS, 2013, pp. 373–382.

M. Zaharia et al., “Apache Spark: a unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, 2016.

S. Tu et al., “Processing analytical queries over encrypted data,” in Proc. VLDB Endow., vol. 6, no. 5, pp. 289–300, 2013.

D. J. Abadi et al., “Design principles for scalable and robust data-intensive systems,” in Proc. ACM SIGMOD, 2016, pp. 1–10.

M. Alizadeh et al., “Data protection and compliance in distributed environments: From theory to practice,” IEEE Security & Privacy, vol. 17, no. 4, pp. 24–32, Jul.–Aug. 2019.

K. Zhang, X. Zhou, Y. Chen, and X. Wang, “Sedic: privacy-aware data intensive computing on hybrid clouds,” in Proc. ACM CCS, 2011, pp. 515–526.

H. Hacigümüş, B. Iyer, C. Li, and S. Mehrotra, “Executing SQL over encrypted data in the database-service-provider model,” in Proc. ACM SIGMOD, 2002, pp. 216–227.

M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proc. ACM CCS, 2015, pp. 1322–1333.

Y. G. Desmedt, “Privacy-preserving data aggregation in cloud environments,” IEEE Trans. Dependable Secure Comput., vol. 13, no. 2, pp. 207–220, Mar.–Apr. 2016.

P. Mell and T. Grance, “The NIST definition of cloud computing,” National Institute of Standards and Technology, Gaithersburg, MD, NIST SP 800-145, Sep. 2011.

A. Chida et al., “Fast secure computation of set intersection,” in Advances in Cryptology – CRYPTO 2018, pp. 3–31.

N. Jay, A. Benhamiche, R. Rouvoy, and L. Seinturier, “Privacy-aware data compliance architecture for cloud data services,” in Proc. IEEE CLOUD, 2019, pp. 105–113.

M. T. Goodrich, “The master theorem for divide-and-conquer recurrences,” in Algorithm Design: Foundations, Analysis, and Internet Examples, Wiley, 2002, ch. 3.

HyperLogLog-Based Compliance Coverage Estimation for Distributed Datasets

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite