class: theme layout: true --- class: center, middle # Talk 15 - NeurIPS 2020 ## Privacy during training: from prototype to Large Scale Data ![:scale 75%](intro.jpg) 17/09/2021 - Talk 15 - Saurabh B --- class: middle ## Information * Github Repository - https://github.com/bakkenbaeck/ml-seminar-information-theory * Original talk - "Robustness, Verification, Privacy: Addressing Machine Learning Adversaries" * Focus on "Privacy" ??? - Practical information about updates to index page and talks on the github repo - Original talk had three parts, explain each part - Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape. - These challenges include verification of machine learning models given limited access to good data - training at scale on private training data - robustness against adversarial examples controlled by worst case adversaries. - robustness - accurate predictions when test distribution adversarial deviates from training distribution --- class: img-left ## Author Bio
1
.left-column[ ![:scale 100%](author.jpeg) Shafrira Goldwasser ] .right-column[ * Professor of Electrical Engineering and Computer Science at MIT * Professor of mathematical sciences at the Weizmann Institute of Science, Israel * Winner of the Turing Award in 2012, Gödel Prize (1993, 2001) * Area of interest include: Theoretical computer science, Cryptography ] .footnote[.red.bold[1:] https://simons.berkeley.edu/people/shafi-goldwasser] ??? - Prestigious Turing award (Nobel prize of Computing!) and many more awards - One of the inventor of Zero knowledge proofs - Presentation and paper co-author with other researchers --- class: middle ## Addressing Machine Learning Adversaries: Privacy * Based on the Paper "Secure large-scale genome-wide association studies using homomorphic encryption" by Blatt et al.
2
* Concerned with large-scale statistical analyses over encrypted data .footnote[.red.bold[2:] https://www.pnas.org/content/pnas/117/21/11608.full.pdf] ??? - secure and efficient multi-party computation over large scale data --- class: middle ## Privacy: Why? * Quality/Quantity of data usually determines the performance of the model * In many cases the data comes from/belongs to an individual * Data goes to third party ??? - data privacy is Data used in both training and testing - Third parties may or may not be trusted - Data is power! ensuring privacy to make sure that individuals has power and not the big companies --- class: middle ## Data privacy at Scale * Example: Data driven analytics in Health Care and Drug development * Collaborative research ??? - there are methods which can be used to ensure privacy but real challenge in practical ML world concerns big data/scale - good examples is analytics in Health Care and Drug development requires large data sets and even combining data sets from different sources - Collaborative research is quite useful but bring in problem of data at scale --- class: middle ## Real world Example: Genome Wide Association Studies * Large scale observation based study * Association between genetic variants and trait/disease * These variants are represented by Single-Nucleotide Polymorphisms (SNPs) ??? - each genotype includes roughly 0.5 - 5 million SNP, i.e. larrgee data - GWAS evaluates over one SNP at a time --- class: middle ## GWAS for COVID-19 * Labs with private genomic data doing GWAS to identify genetic variants associated with response/proneness to COVID * Understand who's most at risk, prioritize for treatment/vaccine * Find new drugs/Vaccines ??? - study participant includes those we have and have not tested positive for the virus - AND those who have had different levels(mild/severe/no) symptoms - compare DNA of people with severe symptoms to those had mild/no symptoms - effectively looking for genetic variant common in risk group - this sounds and great but the issue of privacy prevails --- class: middle ## Privacy issues prevails * Individual level privacy is at stake * Identifying Personal Genomes by Surname Inference by Gymrek et al.
3
.footnote[.red.bold[3:] https://pubmed.ncbi.nlm.nih.gov/23329047/] ??? - you can learn a lot about a person by looking at their genomic data even if data doesn't contain their name or other identifying information - paper showed that by looking at Y-chromosome data and combining other meta data such as age one can identify the person - additionally many times data is not stored at one place which increases the probability of data leak leading to severe privacy problems --- class: middle ## Cryptography to the rescue! * Garbled circuits * Arithmetic MPC * Homomorphic Encryption ??? - using cryptography to ensure privacy of data - there has lot of development since 1980 in this area - similar to zero knowledge proofs in last talk - these methods are great but has there pros and cons, pick based on the problem --- class: middle ## Garbled circuits * Inspired from Yao's Millionaires' problem * Enables secure two party secure computation ??? - The problem discusses two millionaires, Alice and Bob, who are interested in knowing which of them is richer without revealing their actual wealth. - introduced in 1982 by computer scientist and computational theorist Andrew Yao - two mistrusting parties can jointly evaluate a function over their private inputs without the presence of a trusted third party - In the garbled circuit protocol, the function has to be described as a Boolean circuit. - convert ML model to boolean circuits --- class: middle ## Oblivious Transfer * In the garbled circuit protocol, oblivious transfer is used to communicate between two parties involved * The sender has two messages \\( m_0 \\) and \\( m_1 \\), and the receiver has a bit b * The receiver wishes to receive \\( m_b \\), without the sender learning b * The sender wants to ensure that the receiver receives only one of the two messages. ??? - to understand garbled gate we need to understand way of communication first - A garbled boolean circuit is a collection of garbled boolean gates. - First understand what a single garbled gate is and then easily generalize to the entire garbled circuit. - Once input values a, b are selected, we want to compute o(a, b) securely. --- class: middle ## Garbled Computation Table ![:scale 100%](yao_circuit.png) ??? - example: let's say two people are trying to decide if they should work together - problem can be understood as a AND gate - W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 are labels for g, e and output - View W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 as encryption keys and encrypt O_0, O_1 under appropriate pairs of input keys. - Given W_a_G, W_b_E only one row can be decrypted - sender encrypts each row of the table, creating the GCT - sender permutes it (rearranges it), so that the key’s position reveals nothing about the value that it is associated with. - sender sends it over to receiver, along with her input key W_a_prime_G, with a_prime her input value. - receiver still needs his own key to decrypt the GCT so sender must send it to him. - If sender sends both W_0_E, W_1_E to receiver, then receiver can decrypt more and if receiver asks for sender the key that corresponds to his input, then sender learns receiver’s input - In order to calculate the circuit, receiver needs to garble his own input as well. - To this end, receiver needs sender to help him, because only the garbler knows how to encrypt. - Finally, receiver can encrypt his input through oblivious transfer. - receiver now has both keys and the GCT and he can compute the gate by decrypting GCT. - receiver can decrypt only one line of the GCT, exactly because of its construction. - receiver Sends output to sender and the computation of the garbled gate is over --- class: middle ## Homomorphic Encryption * Form of encryption that permits users to perform computations directly on its encrypted data * The encryption and decryption functions can be thought of as homomorphisms between plaintext and ciphertext spaces * Trivial example: Anonymous polling ??? - types of homomorphic encryption are partially homomorphic, somewhat homomorphic, leveled fully homomorphic, and fully homomorphic encryption - Poll: if CF should be held or not? (not on slack) tobias will generate public keys and you can encrypt your answer (yes or no i.e. 0/1) - Send encrypted vote to tobias, tobias will add up all the encrypted votes and obtain the encrypted result - Finally tobias will decrpyt the result using the private key and check if total is more than half of total number of people in BB --- class: middle ## GWAS using Secure MPC * Based on Jagadeesh et al., Cho et al * Study of age related macular generation on 21k individuals and 220k SNPs * Extrapolated runtime for 100k individuals and 0.5 mil SNPs: 193 hrs ??? - secure MPC: split data using sophisticated techniques & secret share -> send to severs (servers don't communicate) -> reconstruct results - The studies achieved very high accuracy on statistical tests - but extrapolated was not efficient -- * Why not use HE? * Cho et al. predicts that HE is not viable for large scale GWAS * Jagadeesh et al. predicts that HE would be 5-10k slower that secure MPC for GWAS ??? - HE looks like a good candidates to ensure privacy - Cho et al.: HE will take many years or computation or petabytes of communication at the scale of million genomes --- class: middle ## GWAS using Homomorphic Encryption * Paper
4
proposes toolbox of statistical techniques that * uses HE to perform large scale GWAS without decryption .footnote[.red.bold[4:] https://www.pnas.org/content/pnas/117/21/11608.full.pdf] ??? - in contrast to the claim that HE is not viable for large-scale GWASs --- class: middle .right-column[ ![:scale 100%](gwa_homomorphic_encryption.png) ] .left-column[ ### HE GWAS schema * Encrypt and send the data * Process in the "Cloud", return the encrypted results * Decrypt the result ] ??? - First, study participants obtain a public key from the GWAS coordinator (not shown in fig) - Each of them encrypts their data using the public key, and sends the encrypted data to the Encrypted Data Bank, storing all encrypted individual-level data from many study participants. - When a specific study is initiated by the GWAS coordinator, the encrypted data for the individuals in the study get transmitted to the HE Compute Cloud for a noninteractive secure computation. - The HE Compute Cloud computes the results and sends them in encrypted form to the GWAS coordinator. - Finally, the GWAS coordinator decrypts the results and routes them to one of the viewers. --- class: middle ## Findings * Could run statistical tests (Alleclic chi-sqaure, logisitc regression) for * Dataset: Approx 26k individuals, 260k number of SNPs * Extrapolated runtime for 100k individuals and 0.5 million SNPs: 5.6 hrs, 234 hrs ??? - Extrapolated runtime: chi-square(prior 193 hrs) and logistic regression(practically impossible) - prior study using multi-party computation --- class: middle ### Linear run time scaling and extrapolation to 100K individuals ![:scale 100%](extrapolated_time.png) ??? - findings shows that there in NO loss in accuracy overall - one of the important results concerns scaling i.e. what happens when you 100k or more individuals involved --- class: middle ## Future work * HE can be used for cases beyond GWAS * Parallelism * Open sources packages ??? - wide variety of use cases for large scale data - this should become norm in future - improve efficiency using parallelism and other optimization techniques - build community and usages with open source --- # Thank You! # Thank You! # Thank You!