Practical information about updates to index page and talks on the github repo
Original talk had three parts, explain each part
Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape.
These challenges include verification of machine learning models given limited access to good data
training at scale on private training data
robustness against adversarial examples controlled by worst case adversaries.
robustness - accurate predictions when test distribution adversarial deviates from training distribution
17/09/2021 - Talk 15 - Saurabh B
Github Repository - https://github.com/bakkenbaeck/ml-seminar-information-theory
Original talk - "Robustness, Verification, Privacy: Addressing Machine Learning Adversaries"
Focus on "Privacy"
Practical information about updates to index page and talks on the github repo
Original talk had three parts, explain each part
Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape.
These challenges include verification of machine learning models given limited access to good data
training at scale on private training data
robustness against adversarial examples controlled by worst case adversaries.
robustness - accurate predictions when test distribution adversarial deviates from training distribution
Shafrira Goldwasser
Professor of Electrical Engineering and Computer Science at MIT
Professor of mathematical sciences at the Weizmann Institute of Science, Israel
Winner of the Turing Award in 2012, Gödel Prize (1993, 2001)
Area of interest include: Theoretical computer science, Cryptography
1: https://simons.berkeley.edu/people/shafi-goldwasser
Prestigious Turing award (Nobel prize of Computing!) and many more awards
One of the inventor of Zero knowledge proofs
Presentation and paper co-author with other researchers
Based on the Paper "Secure large-scale genome-wide association studies using homomorphic encryption" by Blatt et al.2
Concerned with large-scale statistical analyses over encrypted data
2: https://www.pnas.org/content/pnas/117/21/11608.full.pdf
Quality/Quantity of data usually determines the performance of the model
In many cases the data comes from/belongs to an individual
Data goes to third party
data privacy is Data used in both training and testing
Third parties may or may not be trusted
Data is power! ensuring privacy to make sure that individuals has power and not the big companies
Example: Data driven analytics in Health Care and Drug development
Collaborative research
there are methods which can be used to ensure privacy but real challenge in practical ML world concerns big data/scale
good examples is analytics in Health Care and Drug development requires large data sets and even combining data sets from different sources
Collaborative research is quite useful but bring in problem of data at scale
Large scale observation based study
Association between genetic variants and trait/disease
These variants are represented by Single-Nucleotide Polymorphisms (SNPs)
each genotype includes roughly 0.5 - 5 million SNP, i.e. larrgee data
GWAS evaluates over one SNP at a time
Labs with private genomic data doing GWAS to identify genetic variants associated with response/proneness to COVID
Understand who's most at risk, prioritize for treatment/vaccine
Find new drugs/Vaccines
study participant includes those we have and have not tested positive for the virus
AND those who have had different levels(mild/severe/no) symptoms
compare DNA of people with severe symptoms to those had mild/no symptoms
effectively looking for genetic variant common in risk group
this sounds and great but the issue of privacy prevails
Individual level privacy is at stake
Identifying Personal Genomes by Surname Inference by Gymrek et al.3
3: https://pubmed.ncbi.nlm.nih.gov/23329047/
you can learn a lot about a person by looking at their genomic data even if data doesn't contain their name or other identifying information
paper showed that by looking at Y-chromosome data and combining other meta data such as age one can identify the person
additionally many times data is not stored at one place which increases the probability of data leak leading to severe privacy problems
Garbled circuits
Arithmetic MPC
Homomorphic Encryption
using cryptography to ensure privacy of data
there has lot of development since 1980 in this area
similar to zero knowledge proofs in last talk
these methods are great but has there pros and cons, pick based on the problem
Inspired from Yao's Millionaires' problem
Enables secure two party secure computation
The problem discusses two millionaires, Alice and Bob, who are interested in knowing which of them is richer without revealing their actual wealth.
introduced in 1982 by computer scientist and computational theorist Andrew Yao
two mistrusting parties can jointly evaluate a function over their private inputs without the presence of a trusted third party
In the garbled circuit protocol, the function has to be described as a Boolean circuit.
convert ML model to boolean circuits
In the garbled circuit protocol, oblivious transfer is used to communicate between two parties involved
The sender has two messages m0 and m1, and the receiver has a bit b
The receiver wishes to receive mb, without the sender learning b
The sender wants to ensure that the receiver receives only one of the two messages.
to understand garbled gate we need to understand way of communication first
A garbled boolean circuit is a collection of garbled boolean gates.
First understand what a single garbled gate is and then easily generalize to the entire garbled circuit.
Once input values a, b are selected, we want to compute o(a, b) securely.
example: let's say two people are trying to decide if they should work together
problem can be understood as a AND gate
W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 are labels for g, e and output
View W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 as encryption keys and encrypt O_0, O_1 under appropriate pairs of input keys.
Given W_a_G, W_b_E only one row can be decrypted
sender encrypts each row of the table, creating the GCT
sender permutes it (rearranges it), so that the key’s position reveals nothing about the value that it is associated with.
sender sends it over to receiver, along with her input key W_a_prime_G, with a_prime her input value.
receiver still needs his own key to decrypt the GCT so sender must send it to him.
If sender sends both W_0_E, W_1_E to receiver, then receiver can decrypt more and if receiver asks for sender the key that corresponds to his input, then sender learns receiver’s input
In order to calculate the circuit, receiver needs to garble his own input as well.
To this end, receiver needs sender to help him, because only the garbler knows how to encrypt.
Finally, receiver can encrypt his input through oblivious transfer.
receiver now has both keys and the GCT and he can compute the gate by decrypting GCT.
receiver can decrypt only one line of the GCT, exactly because of its construction.
receiver Sends output to sender and the computation of the garbled gate is over
Form of encryption that permits users to perform computations directly on its encrypted data
The encryption and decryption functions can be thought of as homomorphisms between plaintext and ciphertext spaces
Trivial example: Anonymous polling
types of homomorphic encryption are partially homomorphic, somewhat homomorphic, leveled fully homomorphic, and fully homomorphic encryption
Poll: if CF should be held or not? (not on slack) tobias will generate public keys and you can encrypt your answer (yes or no i.e. 0/1)
Send encrypted vote to tobias, tobias will add up all the encrypted votes and obtain the encrypted result
Finally tobias will decrpyt the result using the private key and check if total is more than half of total number of people in BB
Based on Jagadeesh et al., Cho et al
Study of age related macular generation on 21k individuals and 220k SNPs
Extrapolated runtime for 100k individuals and 0.5 mil SNPs: 193 hrs
secure MPC: split data using sophisticated techniques & secret share -> send to severs (servers don't communicate) -> reconstruct results
The studies achieved very high accuracy on statistical tests
but extrapolated was not efficient
Based on Jagadeesh et al., Cho et al
Study of age related macular generation on 21k individuals and 220k SNPs
Extrapolated runtime for 100k individuals and 0.5 mil SNPs: 193 hrs
Why not use HE?
Cho et al. predicts that HE is not viable for large scale GWAS
Jagadeesh et al. predicts that HE would be 5-10k slower that secure MPC for GWAS
secure MPC: split data using sophisticated techniques & secret share -> send to severs (servers don't communicate) -> reconstruct results
The studies achieved very high accuracy on statistical tests
but extrapolated was not efficient
HE looks like a good candidates to ensure privacy
Cho et al.: HE will take many years or computation or petabytes of communication at the scale of million genomes
Paper4 proposes toolbox of statistical techniques that
uses HE to perform large scale GWAS without decryption
4: https://www.pnas.org/content/pnas/117/21/11608.full.pdf
Encrypt and send the data
Process in the "Cloud", return the encrypted results
Decrypt the result
First, study participants obtain a public key from the GWAS coordinator (not shown in fig)
Each of them encrypts their data using the public key, and sends the encrypted data to the Encrypted Data Bank, storing all encrypted individual-level data from many study participants.
When a specific study is initiated by the GWAS coordinator, the encrypted data for the individuals in the study get transmitted to the HE Compute Cloud for a noninteractive secure computation.
The HE Compute Cloud computes the results and sends them in encrypted form to the GWAS coordinator.
Finally, the GWAS coordinator decrypts the results and routes them to one of the viewers.
Could run statistical tests (Alleclic chi-sqaure, logisitc regression) for
Dataset: Approx 26k individuals, 260k number of SNPs
Extrapolated runtime for 100k individuals and 0.5 million SNPs: 5.6 hrs, 234 hrs
Extrapolated runtime: chi-square(prior 193 hrs) and logistic regression(practically impossible)
prior study using multi-party computation
findings shows that there in NO loss in accuracy overall
one of the important results concerns scaling i.e. what happens when you 100k or more individuals involved
HE can be used for cases beyond GWAS
Parallelism
Open sources packages
wide variety of use cases for large scale data
this should become norm in future
improve efficiency using parallelism and other optimization techniques
build community and usages with open source
Github Repository - https://github.com/bakkenbaeck/ml-seminar-information-theory
Original talk - "Robustness, Verification, Privacy: Addressing Machine Learning Adversaries"
Focus on "Privacy"
Practical information about updates to index page and talks on the github repo
Original talk had three parts, explain each part
Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape.
These challenges include verification of machine learning models given limited access to good data
training at scale on private training data
robustness against adversarial examples controlled by worst case adversaries.
robustness - accurate predictions when test distribution adversarial deviates from training distribution
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |