by **Pratyush Ranjan Tiwari, PhD student @Johns Hopkins**
ML as a service (MLaaS) is a business, the model owner has a model of high value, it either leverages sensitive data only available to the model owner, or its the best model for that task, which can be both due to data asymmetry or due specialized ability of the model owner in training high-accuracy models. In either case, ML models are very valuable, so its unideal to expose model parameters publicly. Hence, the model is private and the only way to use the model is by submitting a request for inference to the model owner.
However, current systems make an assumption that the model owner is honest. The model owner self-certifies its own model’s accuracy, and the fact that the said model was used for inference. People pay for using ML models, in the pay-per-inference model, it is not clear how to ensure that the model owner does not cheat. The model owner promises that their model works with over x% accuracy, and there is no verification process.
Taking Amazon Forecast feature as an example, assume Amazon promises 90% accuracy for forecasting business expenses for certain types of businesses:
Zero-knowledge (zk) proofs can help us resolve this issue. Zero-knowledge proof protocols allow a prover to prove to a verifier that a given statement is true without revealing any other information. Interested readers can refer to the following material to learn more about zk proofs: non-technical introduction and relevance, theory of zk. In this case, the prover statement has two main aspects, the prover proves that:
I own a model for classification/inference for a particular task with high accuracy
I used this same model to perform classification/inference using the given input x and the classification/inference produces the output y
At a high level, the protcol proceeds as follows:
Naively designing proof circuits which convert a ML model to an arithmetic circuit is not always a feasible solution. In most cases such circuits are too large and consequently, the prover size computation is very taxing. To resolve this issues, many research papers and projects have aimed to come up with specialized protocols for particular ML models. We know how to build (reasonably) efficient zk proof protocols to prove model classification and accuracy for the follow ML models:
In linear regression, given a data set of a form where each input is of the form $(x_i, y_i)$ such that $x_i$ is a vector of size $p$. The goal is to find a linear relationship between different inputs such that the loss function is minimized. Mean squared error is a common loss function utilized for this purpose. Therefore, the the model takes the form: