In traditional machine learning, data is often collected in a central location and used to train a model. This process typically involves making multiple copies of the data, which are stored on various servers or cloud platforms. These copies are then used to train the model, and the resulting model may also be copied and shared with multiple parties. The copy problem arises when sensitive data is inadvertently copied, stored, or transmitted to unauthorized parties, or when copies of the data are not properly secured. For example, if a copy of the data is transmitted over an insecure network, it could be intercepted by malicious actors. But even if we disregard deliberate, malicious actions, over time we might lose control of who has access to copies of data. All in all, it is good security practice to adhere to data minimization principles and seek to avoid unnecessarily moving or disclosing data.
The copy-problem is at the core of what we do at Scaleout:
Taken together, this provides the necessary technology to “bring machine learning to data instead of data to machine learning”.
There are traditional approaches to protecting your input data while copying and centralizing it, most prominently data anonymization. This can require quite extensive work, and there is a risk that the utility of the data is reduced. Federated machine learning offers a pragmatic (and in our opinion quite elegant”!) alternative, by simply leaving the data where it is. As always, there is no free lunch - there are technical challenges with building a production-grade federated learning system (we have written about that in a previous post). In the end, you can of course combine federated learning with more traditional approaches, stacking layers of security to improve data protection.
Taking a broader perspective, Federated machine learning sits in the wider context of privacy-enhancing technologies (PETs). In addition to FL, privacy-preserving AI involves a range of approaches, including differential privacy, homomorphic encryption, and synthetic data generation. These are all complementary technologies, each offering to address different aspects of privacy and security in the end-to-end machine learning information flow. But while the copy-problem is relatively intuitive, other challenges in AI security require a bit deeper look at the concepts of input privacy and output privacy.
If an unauthorized party gains access to the input data during the process of computation, this is a breach of input privacy. In our context, this translates to an attacker gaining access to the training data (or its sensitive properties) during the forward computational process of creating the machine learning model. This also applies to the data prompted to the finished model to make predictions. Federated machine learning fundamentally deals with input privacy, excelling at the scenario where training data is spread over different devices or locations.
If an attacker is able to reverse-engineer private data from the output of a computation, this is a breach of output privacy. In our setting this translates to attempts at inferring training data (or properties of that data) from model output. This is a potential threat to any machine learning system since it in theory only requires knowledge about the model structure and access to either the model itself or an hosted API for making predictions. The most fundamental layer of protection against reverse-engineering attempts is thus to make sure to control access to model endpoints to as large extend as possible. However, advanced techniques such as differential privacy can add another layer of protection, at the expense of degradation of model performance.
In future posts, we will dig deeper into security challenges for machine learning systems, with a particular focus on the extension of secure MLOps to the decentralized setting which federated learning represents.