+ - 0:00:00
Notes for current slide
Notes for next slide
  • Practical information about updates to index page and talks on the github repo

  • Original talk had three parts, explain each part

  • Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape.

  • These challenges include verification of machine learning models given limited access to good data

  • training at scale on private training data

  • robustness against adversarial examples controlled by worst case adversaries.

  • robustness - accurate predictions when test distribution adversarial deviates from training distribution

Talk 15 - NeurIPS 2020

Privacy during training: from prototype to Large Scale Data

17/09/2021 - Talk 15 - Saurabh B

1 / 21

Information

2 / 21
  • Practical information about updates to index page and talks on the github repo

  • Original talk had three parts, explain each part

  • Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape.

  • These challenges include verification of machine learning models given limited access to good data

  • training at scale on private training data

  • robustness against adversarial examples controlled by worst case adversaries.

  • robustness - accurate predictions when test distribution adversarial deviates from training distribution

Author Bio1

Shafrira Goldwasser

  • Professor of Electrical Engineering and Computer Science at MIT

  • Professor of mathematical sciences at the Weizmann Institute of Science, Israel

  • Winner of the Turing Award in 2012, Gödel Prize (1993, 2001)

  • Area of interest include: Theoretical computer science, Cryptography

1: https://simons.berkeley.edu/people/shafi-goldwasser

3 / 21
  • Prestigious Turing award (Nobel prize of Computing!) and many more awards

  • One of the inventor of Zero knowledge proofs

  • Presentation and paper co-author with other researchers

Addressing Machine Learning Adversaries: Privacy

  • Based on the Paper "Secure large-scale genome-wide association studies using homomorphic encryption" by Blatt et al.2

  • Concerned with large-scale statistical analyses over encrypted data

2: https://www.pnas.org/content/pnas/117/21/11608.full.pdf

4 / 21
  • secure and efficient multi-party computation over large scale data

Privacy: Why?

  • Quality/Quantity of data usually determines the performance of the model

  • In many cases the data comes from/belongs to an individual

  • Data goes to third party

5 / 21
  • data privacy is Data used in both training and testing

  • Third parties may or may not be trusted

  • Data is power! ensuring privacy to make sure that individuals has power and not the big companies

Data privacy at Scale

  • Example: Data driven analytics in Health Care and Drug development

  • Collaborative research

6 / 21
  • there are methods which can be used to ensure privacy but real challenge in practical ML world concerns big data/scale

  • good examples is analytics in Health Care and Drug development requires large data sets and even combining data sets from different sources

  • Collaborative research is quite useful but bring in problem of data at scale

Real world Example: Genome Wide Association Studies

  • Large scale observation based study

  • Association between genetic variants and trait/disease

  • These variants are represented by Single-Nucleotide Polymorphisms (SNPs)

7 / 21
  • each genotype includes roughly 0.5 - 5 million SNP, i.e. larrgee data

  • GWAS evaluates over one SNP at a time

GWAS for COVID-19

  • Labs with private genomic data doing GWAS to identify genetic variants associated with response/proneness to COVID

  • Understand who's most at risk, prioritize for treatment/vaccine

  • Find new drugs/Vaccines

8 / 21
  • study participant includes those we have and have not tested positive for the virus

  • AND those who have had different levels(mild/severe/no) symptoms

  • compare DNA of people with severe symptoms to those had mild/no symptoms

  • effectively looking for genetic variant common in risk group

  • this sounds and great but the issue of privacy prevails

Privacy issues prevails

  • Individual level privacy is at stake

  • Identifying Personal Genomes by Surname Inference by Gymrek et al.3

3: https://pubmed.ncbi.nlm.nih.gov/23329047/

9 / 21
  • you can learn a lot about a person by looking at their genomic data even if data doesn't contain their name or other identifying information

  • paper showed that by looking at Y-chromosome data and combining other meta data such as age one can identify the person

  • additionally many times data is not stored at one place which increases the probability of data leak leading to severe privacy problems

Cryptography to the rescue!

  • Garbled circuits

  • Arithmetic MPC

  • Homomorphic Encryption

10 / 21
  • using cryptography to ensure privacy of data

  • there has lot of development since 1980 in this area

  • similar to zero knowledge proofs in last talk

  • these methods are great but has there pros and cons, pick based on the problem

Garbled circuits

  • Inspired from Yao's Millionaires' problem

  • Enables secure two party secure computation

11 / 21
  • The problem discusses two millionaires, Alice and Bob, who are interested in knowing which of them is richer without revealing their actual wealth.

  • introduced in 1982 by computer scientist and computational theorist Andrew Yao

  • two mistrusting parties can jointly evaluate a function over their private inputs without the presence of a trusted third party

  • In the garbled circuit protocol, the function has to be described as a Boolean circuit.

  • convert ML model to boolean circuits

Oblivious Transfer

  • In the garbled circuit protocol, oblivious transfer is used to communicate between two parties involved

  • The sender has two messages m0 m_0 and m1 m_1 , and the receiver has a bit b

  • The receiver wishes to receive mb m_b , without the sender learning b

  • The sender wants to ensure that the receiver receives only one of the two messages.

12 / 21
  • to understand garbled gate we need to understand way of communication first

  • A garbled boolean circuit is a collection of garbled boolean gates.

  • First understand what a single garbled gate is and then easily generalize to the entire garbled circuit.

  • Once input values a, b are selected, we want to compute o(a, b) securely.

Garbled Computation Table

13 / 21
  • example: let's say two people are trying to decide if they should work together

  • problem can be understood as a AND gate

  • W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 are labels for g, e and output

  • View W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 as encryption keys and encrypt O_0, O_1 under appropriate pairs of input keys.

  • Given W_a_G, W_b_E only one row can be decrypted

  • sender encrypts each row of the table, creating the GCT

  • sender permutes it (rearranges it), so that the key’s position reveals nothing about the value that it is associated with.

  • sender sends it over to receiver, along with her input key W_a_prime_G, with a_prime her input value.

  • receiver still needs his own key to decrypt the GCT so sender must send it to him.

  • If sender sends both W_0_E, W_1_E to receiver, then receiver can decrypt more and if receiver asks for sender the key that corresponds to his input, then sender learns receiver’s input

  • In order to calculate the circuit, receiver needs to garble his own input as well.

  • To this end, receiver needs sender to help him, because only the garbler knows how to encrypt.

  • Finally, receiver can encrypt his input through oblivious transfer.

  • receiver now has both keys and the GCT and he can compute the gate by decrypting GCT.

  • receiver can decrypt only one line of the GCT, exactly because of its construction.

  • receiver Sends output to sender and the computation of the garbled gate is over

Homomorphic Encryption

  • Form of encryption that permits users to perform computations directly on its encrypted data

  • The encryption and decryption functions can be thought of as homomorphisms between plaintext and ciphertext spaces

  • Trivial example: Anonymous polling

14 / 21
  • types of homomorphic encryption are partially homomorphic, somewhat homomorphic, leveled fully homomorphic, and fully homomorphic encryption

  • Poll: if CF should be held or not? (not on slack) tobias will generate public keys and you can encrypt your answer (yes or no i.e. 0/1)

  • Send encrypted vote to tobias, tobias will add up all the encrypted votes and obtain the encrypted result

  • Finally tobias will decrpyt the result using the private key and check if total is more than half of total number of people in BB

GWAS using Secure MPC

  • Based on Jagadeesh et al., Cho et al

  • Study of age related macular generation on 21k individuals and 220k SNPs

  • Extrapolated runtime for 100k individuals and 0.5 mil SNPs: 193 hrs

15 / 21
  • secure MPC: split data using sophisticated techniques & secret share -> send to severs (servers don't communicate) -> reconstruct results

  • The studies achieved very high accuracy on statistical tests

  • but extrapolated was not efficient

GWAS using Secure MPC

  • Based on Jagadeesh et al., Cho et al

  • Study of age related macular generation on 21k individuals and 220k SNPs

  • Extrapolated runtime for 100k individuals and 0.5 mil SNPs: 193 hrs

  • Why not use HE?

  • Cho et al. predicts that HE is not viable for large scale GWAS

  • Jagadeesh et al. predicts that HE would be 5-10k slower that secure MPC for GWAS

15 / 21
  • secure MPC: split data using sophisticated techniques & secret share -> send to severs (servers don't communicate) -> reconstruct results

  • The studies achieved very high accuracy on statistical tests

  • but extrapolated was not efficient

  • HE looks like a good candidates to ensure privacy

  • Cho et al.: HE will take many years or computation or petabytes of communication at the scale of million genomes

GWAS using Homomorphic Encryption

  • Paper4 proposes toolbox of statistical techniques that

  • uses HE to perform large scale GWAS without decryption

4: https://www.pnas.org/content/pnas/117/21/11608.full.pdf

16 / 21
  • in contrast to the claim that HE is not viable for large-scale GWASs

HE GWAS schema

  • Encrypt and send the data

  • Process in the "Cloud", return the encrypted results

  • Decrypt the result

17 / 21
  • First, study participants obtain a public key from the GWAS coordinator (not shown in fig)

  • Each of them encrypts their data using the public key, and sends the encrypted data to the Encrypted Data Bank, storing all encrypted individual-level data from many study participants.

  • When a specific study is initiated by the GWAS coordinator, the encrypted data for the individuals in the study get transmitted to the HE Compute Cloud for a noninteractive secure computation.

  • The HE Compute Cloud computes the results and sends them in encrypted form to the GWAS coordinator.

  • Finally, the GWAS coordinator decrypts the results and routes them to one of the viewers.

Findings

  • Could run statistical tests (Alleclic chi-sqaure, logisitc regression) for

  • Dataset: Approx 26k individuals, 260k number of SNPs

  • Extrapolated runtime for 100k individuals and 0.5 million SNPs: 5.6 hrs, 234 hrs

18 / 21
  • Extrapolated runtime: chi-square(prior 193 hrs) and logistic regression(practically impossible)

  • prior study using multi-party computation

Linear run time scaling and extrapolation to 100K individuals

19 / 21
  • findings shows that there in NO loss in accuracy overall

  • one of the important results concerns scaling i.e. what happens when you 100k or more individuals involved

Future work

  • HE can be used for cases beyond GWAS

  • Parallelism

  • Open sources packages

20 / 21
  • wide variety of use cases for large scale data

  • this should become norm in future

  • improve efficiency using parallelism and other optimization techniques

  • build community and usages with open source

Thank You!

Thank You!

Thank You!

21 / 21

Information

2 / 21
  • Practical information about updates to index page and talks on the github repo

  • Original talk had three parts, explain each part

  • Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape.

  • These challenges include verification of machine learning models given limited access to good data

  • training at scale on private training data

  • robustness against adversarial examples controlled by worst case adversaries.

  • robustness - accurate predictions when test distribution adversarial deviates from training distribution

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow