WIP - Privacy-preserving machine learning: Methods, challenges and directions#

Note

Hey guys, this is my personal reading note. I am not sure there might be some mistakes in my understanding. Please feel free to correct me (hsiangjenli@gmail.com) if you find any. Thanks!

Before starting#

Before starting to read the paper, the basic concepts you need to know are as follows:

  • Entire ML pipeline process

  • The participants in the ML pipeline

Key Terms#

  1. Privacy-preserving machine learning (PPML)

  2. Complete Model \(\rightarrow\) Train on single machine

  3. Global Model \(\rightarrow\) Train on multiple machines

  4. Data Producer (DP)

  5. Model Consumer (MC)

  6. Computational Facility (CF)

  7. Confidential-level privacy

  8. Homomorphic encryption (HE)

  9. Functional encryption (FE)

  10. Differential privacy

  11. Multi-party computation (MPC)

  12. Secure multi-party computation (SMPC)

  13. Garbled circuit

  14. Oblivious transfer

Contributions [1]#

  1. Existing privacy preserving approaches

  2. Proposed an evaluation framework for PPML, which decomposes privacy-preserving features into distinct Phase, Guarantee, and Utility aspects (PGU).

  • Phase : Represents the use of privacy-preserving techniques at different stages in the ML pipeline

  • Guarantee : In specific scenarios, privacy-preserving techniques provide certain levels of privacy protection

  • Utility : The impact of privacy-preserving techniques on the model’s performance

Phases of ML Pipeline#

  • The techniques that can be applied to training phase, it also can be applied to the serving phase. [1]

  • However, the techniques that can be applied to the serving phase, it may not be applied to the training phase. [1]

../_images/image.png

This figure is taken from the paper [1]#

Privacy Preserving Data Preparation (Data Perspective)#

  1. Traditional anonymization mechanism : Remove the identifier information before training

    • \(k\)-anonymity [2]

    • \(l\)-diversity [3]

    • \(t\)-closeness [4]

  2. Surrogate dataset

    • Grouping the anonymized data [5]

    • Abstracting the data by sketch techniques [6, 7]

  3. Differential privacy mechanism [8, 9, 10] : Add noise to the data to avoid privacy leakage

    • Inference or de-anonymization attacks [1] : Like [11, 12, 13, 14]

  4. Encrypted data

    • Confidential-level privacy

Privacy Preserving Model Training (Computational Perspective)#

Supporting computation on encrypted data [1]. Typically, encryption techniques involve two main steps: encoding and decoding [1].

  • Encoding \(\rightarrow\) Transform floating-point values into integers

  • Decoding \(\rightarrow\) Recover the floating-point values from trained model or crypto-based training results

  1. Homomorphic encryption :

    • BGV scheme [15]

    • CKKS [16] : Supports approximate arithmetic computation

  2. Functional encryption :

    • Multi-party functional encryption [17, 18]

Privacy Preserving Model Serving (Model Perspective)#

Include model deployment and inference [1]

  1. Private aggregation of teacher ensembles (PATE)

  2. Model transform

  3. Model compression

Privacy Guarantee#

  1. Object-Oriented Privacy Guarantee

    • Data oriented privacy guarantee : Prevent the leakage of data, but it will sacrifice of the data utility [1]

      • Anonymization mechanism needs to aggregate and remove proper feature values. Simultaneously, certain values of quasi-identifier features are erased altogether

      • Differential privacy requires the addition of a noise budget to the data sample.

      • Encrypted data may ensure the dataset’s confidentiality, it brings extra processing burden to the subsequent machine learning training.

    • Model oriented privacy guarantee : Prevent adversaries from extracting private information through repeated model queries [1]

      • Perturb the trained model

        • DP-SGD [19] : Adding noise into the clipped gradients to achieve a differentially private model

      • Regulate the model access times and patterns

  2. Pipeline-Oriented Privacy Guarantee

References#

[1] (1,2,3,4,5,6,7,8,9,10,11,12)

Runhua Xu, Nathalie Baracaldo, and James Joshi. Privacy-preserving machine learning: methods, challenges and directions. arXiv preprint arXiv:2108.04417, 2021.

[2]

Latanya Sweeney. K-anonymity: a model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557–570, 2002.

[3]

Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. L-diversity: privacy beyond k-anonymity. Acm transactions on knowledge discovery from data (tkdd), 1(1):3–es, 2007.

[4]

Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. T-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd international conference on data engineering, 106–115. IEEE, 2006.

[5]

Mengwei Yang, Linqi Song, Jie Xu, Congduan Li, and Guozhen Tan. The tradeoff between privacy and accuracy in anomaly detection using federated xgboost. arXiv preprint arXiv:1907.07157, 2019.

[6]

Tian Li, Zaoxing Liu, Vyas Sekar, and Virginia Smith. Privacy for free: communication-efficient learning with differential privacy using sketches. arXiv preprint arXiv:1911.00972, 2019.

[7]

Farzin Haddadpour, Belhal Karimi, Ping Li, and Xiaoyun Li. Fedsketch: communication-efficient and private federated learning via sketching. arXiv preprint arXiv:2008.04975, 2020.

[8]

Cynthia Dwork. Differential privacy: a survey of results. In International conference on theory and applications of models of computation, 1–19. Springer, 2008.

[9]

Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In 2010 IEEE 51st annual symposium on foundations of computer science, 51–60. IEEE, 2010.

[10]

Cynthia Dwork, Aaron Roth, and others. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.

[11]

Gilbert Wondracek, Thorsten Holz, Engin Kirda, and Christopher Kruegel. A practical attack to de-anonymize social network users. In 2010 ieee symposium on security and privacy, 223–238. IEEE, 2010.

[12]

Md Atiqur Rahman, Tanzila Rahman, Robert Laganière, Noman Mohammed, and Yang Wang. Membership inference attack against differentially private deep learning model. Trans. Data Priv., 11(1):61–79, 2018.

[13]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), 3–18. IEEE, 2017.

[14]

Jianwei Qian, Xiang-Yang Li, Chunhong Zhang, and Linlin Chen. De-anonymizing social networks and inferring private attributes using knowledge graphs. In IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, 1–9. IEEE, 2016.

[15]

Masahiro Yagisawa. Fully homomorphic encryption without bootstrapping. Cryptology ePrint Archive, 2015.

[16]

Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. Homomorphic encryption for arithmetic of approximate numbers. In Advances in Cryptology–ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, December 3-7, 2017, Proceedings, Part I 23, 409–437. Springer, 2017.

[17]

Michel Abdalla, Florian Bourse, Angelo De Caro, and David Pointcheval. Simple functional encryption schemes for inner products. In IACR International Workshop on Public Key Cryptography, 733–751. Springer, 2015.

[18]

Michel Abdalla, Dario Catalano, Dario Fiore, Romain Gay, and Bogdan Ursu. Multi-input functional encryption for inner products: function-hiding realizations and constructions without pairings. In Advances in Cryptology–CRYPTO 2018: 38th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 19–23, 2018, Proceedings, Part I 38, 597–627. Springer, 2018.

[19]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 308–318. 2016.