ML2021 Talk 15 - NIPS2020

class: theme
layout: true
---

class: center, middle

# Talk 15 - NeurIPS 2020

## Privacy during training: from prototype to Large Scale Data

![:scale 75%](intro.jpg)

17/09/2021 - Talk 15 - Saurabh B

---
class: middle

## Information

* Github Repository - https://github.com/bakkenbaeck/ml-seminar-information-theory

* Original talk - "Robustness, Verification, Privacy: Addressing Machine Learning Adversaries"

* Focus on "Privacy"

???
- Practical information about updates to index page and talks on the github repo

- Original talk had three parts, explain each part

- Cryptography inspired models and results to address three challenges that emerge when worst-case adversaries enter the machine learning landscape.

- These challenges include verification of machine learning models given limited access to good data

- training at scale on private training data

- robustness against adversarial examples controlled by worst case adversaries.

- robustness -  accurate predictions when test distribution adversarial deviates from training distribution
---
class: img-left

## Author Bio1

.left-column[
![:scale 100%](author.jpeg) Shafrira Goldwasser
] 
.right-column[
* Professor of Electrical Engineering and Computer Science at MIT

* Professor of mathematical sciences at the Weizmann Institute of Science, Israel

* Winner of the Turing Award in 2012, Gödel Prize (1993, 2001)

* Area of interest include: Theoretical computer science, Cryptography
]

.footnote[.red.bold[1:] https://simons.berkeley.edu/people/shafi-goldwasser]
???
- Prestigious Turing award (Nobel prize of Computing!) and many more awards

- One of the inventor of Zero knowledge proofs

- Presentation and paper co-author with other researchers
---
class: middle

## Addressing Machine Learning Adversaries: Privacy

* Based on the Paper "Secure large-scale genome-wide association studies using homomorphic encryption" by Blatt et al.2

* Concerned with large-scale statistical analyses over encrypted data

.footnote[.red.bold[2:] https://www.pnas.org/content/pnas/117/21/11608.full.pdf]
???
- secure and efficient multi-party computation over large scale data
---
class: middle

## Privacy: Why?

* Quality/Quantity of data usually determines the performance of the model

* In many cases the data comes from/belongs to an individual

* Data goes to third party
???
- data privacy is Data used in both training and testing

- Third parties may or may not be trusted

- Data is power! ensuring privacy to make sure that individuals has power and not the big companies
---
class: middle

## Data privacy at Scale

* Example: Data driven analytics in Health Care and Drug development

* Collaborative research 
???
- there are methods which can be used to ensure privacy but real challenge in practical ML world concerns big data/scale

- good examples is analytics in Health Care and Drug development requires large data sets and even combining data sets from different sources

- Collaborative research is quite useful but bring in problem of data at scale
---
class: middle

## Real world Example: Genome Wide Association Studies

* Large scale observation based study

* Association between genetic variants and trait/disease

* These variants are represented by Single-Nucleotide Polymorphisms (SNPs)
???
- each genotype includes roughly 0.5 - 5 million SNP, i.e. larrgee data

- GWAS evaluates over one SNP at a time
---
class: middle

## GWAS for COVID-19

* Labs with private genomic data doing GWAS to identify genetic variants associated with response/proneness to COVID

* Understand who's most at risk, prioritize for treatment/vaccine

* Find new drugs/Vaccines
???
- study participant includes those we have and have not tested positive for the virus

- AND those who have had different levels(mild/severe/no) symptoms

- compare DNA of people with severe symptoms to those had mild/no symptoms

- effectively looking for genetic variant common in risk group

- this sounds and great but the issue of privacy prevails
---
class: middle

## Privacy issues prevails

* Individual level privacy is at stake

* Identifying Personal Genomes by Surname Inference by Gymrek et al.3

.footnote[.red.bold[3:] https://pubmed.ncbi.nlm.nih.gov/23329047/]
???
- you can learn a lot about a person by looking at their genomic data even if data doesn't contain their name or other identifying information

- paper showed that by looking at Y-chromosome data and combining other meta data such as age one can identify the person

- additionally many times data is not stored at one place which increases the probability of data leak leading to severe privacy problems
---
class: middle

## Cryptography to the rescue!

* Garbled circuits

* Arithmetic MPC

* Homomorphic Encryption
???
- using cryptography to ensure privacy of data

- there has lot of development since 1980 in this area

- similar to zero knowledge proofs in last talk

- these methods are great but has there pros and cons, pick based on the problem
---
class: middle

## Garbled circuits

* Inspired from Yao's Millionaires' problem

* Enables secure two party secure computation
???
- The problem discusses two millionaires, Alice and Bob, who are interested in knowing which of them is richer without revealing their actual wealth.

- introduced in 1982 by computer scientist and computational theorist Andrew Yao

- two mistrusting parties can jointly evaluate a function over their private inputs without the presence of a trusted third party

- In the garbled circuit protocol, the function has to be described as a Boolean circuit.

- convert ML model to boolean circuits
---
class: middle

## Oblivious Transfer

* In the garbled circuit protocol, oblivious transfer is used to communicate between two parties involved

* The sender has two messages \\( m_0 \\) and \\( m_1 \\), and the receiver has a bit b

* The receiver wishes to receive \\( m_b \\), without the sender learning b

* The sender wants to ensure that the receiver receives only one of the two messages. 
???

- to understand garbled gate we need to understand way of communication first

- A garbled boolean circuit is a collection of garbled boolean gates.

- First understand what a single garbled gate is and then easily generalize to the entire garbled circuit.

- Once input values a, b are selected, we want to compute o(a, b) securely.
---
class: middle

## Garbled Computation Table
![:scale 100%](yao_circuit.png)

???
- example: let's say two people are trying to decide if they should work together

- problem can be understood as a AND gate

- W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 are labels for g, e and output

- View W_0_G, W_0_E, W_1_G, W_1_E, O_0, O_1 as encryption keys and encrypt O_0, O_1 under appropriate pairs of input keys.

- Given W_a_G, W_b_E only one row can be decrypted

- sender encrypts each row of the table, creating the GCT

- sender permutes it (rearranges it), so that the key’s position reveals nothing about the value that it is associated with.

- sender sends it over to receiver, along with her input key W_a_prime_G, with a_prime her input value.

- receiver still needs his own key to decrypt the GCT so sender must send it to him.

- If sender sends both W_0_E, W_1_E to receiver, then receiver can decrypt more and if receiver asks for sender the key that corresponds to his input, then sender learns receiver’s input

- In order to calculate the circuit, receiver needs to garble his own input as well.

- To this end, receiver needs sender to help him, because only the garbler knows how to encrypt.

- Finally, receiver can encrypt his input through oblivious transfer.

- receiver now has both keys and the GCT and he can compute the gate by decrypting GCT.

- receiver can decrypt only one line of the GCT, exactly because of its construction.

- receiver Sends output to sender and the computation of the garbled gate is over
---
class: middle

## Homomorphic Encryption

* Form of encryption that permits users to perform computations directly on its encrypted data

* The encryption and decryption functions can be thought of as homomorphisms between plaintext and ciphertext spaces

* Trivial example: Anonymous polling
???

- types of homomorphic encryption are partially homomorphic, somewhat homomorphic, leveled fully homomorphic, and fully homomorphic encryption

- Poll: if CF should be held or not? (not on slack) tobias will generate public keys and you can encrypt your answer (yes or no i.e. 0/1)

- Send encrypted vote to tobias, tobias will add up all the encrypted votes and obtain the encrypted result

- Finally tobias will decrpyt the result using the private key and check if total is more than half of total number of people in BB
---
class: middle

## GWAS using Secure MPC

* Based on Jagadeesh et al., Cho et al

* Study of age related macular generation on 21k individuals and 220k SNPs

* Extrapolated runtime for 100k individuals and 0.5 mil SNPs: 193 hrs
???
- secure MPC: split data using sophisticated techniques & secret share -> send to severs (servers don't communicate) -> reconstruct results

- The studies achieved very high accuracy on statistical tests

- but extrapolated was not efficient 
--

* Why not use HE?

* Cho et al. predicts that HE is not viable for large scale GWAS

* Jagadeesh et al. predicts that HE would be 5-10k slower that secure MPC for GWAS

???
- HE looks like a good candidates to ensure privacy

- Cho et al.: HE will take many years or computation or petabytes of communication at the scale of million genomes
---
class: middle

## GWAS using Homomorphic Encryption

* Paper4 proposes toolbox of statistical techniques that

* uses HE to perform large scale GWAS without decryption

.footnote[.red.bold[4:] https://www.pnas.org/content/pnas/117/21/11608.full.pdf]

???
- in contrast to the claim that HE is not viable for large-scale GWASs

---
class: middle

.right-column[
![:scale 100%](gwa_homomorphic_encryption.png)
]
.left-column[
### HE GWAS schema

* Encrypt and send the data

* Process in the "Cloud", return the encrypted results

* Decrypt the result
]
???
- First, study participants obtain a public key from the GWAS coordinator (not shown in fig)

- Each of them encrypts their data using the public key, and sends the encrypted data to the Encrypted Data Bank, storing all encrypted individual-level data from many study participants.

- When a specific study is initiated by the GWAS coordinator, the encrypted data for the individuals in the study get transmitted to the HE Compute Cloud for a noninteractive secure computation.

- The HE Compute Cloud computes the results and sends them in encrypted form to the GWAS coordinator.

- Finally, the GWAS coordinator decrypts the results and routes them to one of the viewers.
---
class: middle

## Findings

* Could run statistical tests (Alleclic chi-sqaure, logisitc regression) for

* Dataset: Approx 26k individuals, 260k number of SNPs

* Extrapolated runtime for 100k individuals and 0.5 million SNPs: 5.6 hrs, 234 hrs

???

- Extrapolated runtime: chi-square(prior 193 hrs) and logistic regression(practically impossible)

- prior study using multi-party computation
---
class: middle

### Linear run time scaling and extrapolation to 100K individuals

![:scale 100%](extrapolated_time.png)

???
- findings shows that there in NO loss in accuracy overall

- one of the important results concerns scaling i.e. what happens when you 100k or more individuals involved
---
class: middle

## Future work

* HE can be used for cases beyond GWAS

* Parallelism

* Open sources packages
???
- wide variety of use cases for large scale data

- this should become norm in future

- improve efficiency using parallelism and other optimization techniques

- build community and usages with open source
---

# Thank You!