Contributing to OpenMined

Summing up 3 months with the community

Published in

metaflow-ai

9 min readMar 2, 2018

If you are new to the OpenMined community, this article provides an overview of the current situation. I hope it will help you see more clearly the purpose and challenges of the community so you don’t get overwhelmed by the project complexity👍🏻

As OpenMined is evolving fast, the date of the publication of this article is very important — March 01, 2018.

TL;DR:

OpenMined is a very ambitious community, building OpenSource tools to ease the developments of applications leveraging machine learning on private/confidential data, automatically complying with legal restriction thanks to cryptography and automatically remunerating data/compute/expertise contributors thanks to cryptocurrencies.

If you feel like it, please get involved! A mind-boggling and rich experience awaits you.

What is OpenMined?

OpenMined is a project that has been started 8 months ago by 👉🏻 Trask 👈🏻 when he made the observation that no OpenSource tools existed to do private ML. From this simple observation to its current state, the OpenMined community had to face a mountain of technical challenges which gave birth to the current incredibly ambitious project.

I’ve been following and contributing to it for the last 3 months through code contributions on their Github and by organising hackathons in Paris (feel free to join, it’s really fun and interesting! 🍺). I’ve been amazed by the community itself which is very open, positive and eager to make the project succeed.

Many developers are contributing every day to the different repositories, Hell many repositories have been created and/or deprecated in the last 3 months! It’s been pretty wild and that’s amazing, it shows how lively the project is!

While reading this, if you feel the urge to contribute, please reach out to our slack! 👈🏻

What are we trying to do?

OpenMined is currently concerned with 3 problems surrounding artificial intelligence:

How to deal with sensitive/confidential data?
How to deal with the explosion of the computational need to train Machine learning models?
How to handle remuneration for any contribution would it be data, computation or expertise?

Those 3 concerns are at the heart of OpenMined and shaped the project to its current form one after the other. The community realised pretty fast that a project lacking any of those 3 pillars would not fit the bill and not be ambitious enough to support the overall vision:

Building OpenSource tools to ease the creation and training of machine learning pipeline for private/confidential data while automatically remunerate the different contributors.

Think about it, companies and researchers will soon face a lot more regulation from governments (we can already see many initiatives like #GDPR and growing concerns in general about AI) and this will hinder their capacity to innovate if they don’t have the right tool to strive in the upcoming new regulatory space.

3 problems, 3 solutions

Those 3 problems can be solved with 3 overall solutions:

Encryption: which allows one to securely keep models and data private against malicious actors
Federated learning: which allows one to train computationally greedy machine learning models in a decentralised manner on less computationally efficient devices.
Gradient marketplace: which allows one to be remunerated for its contribution to the training process.

Each one of those 3 pillars supports the whole ecosystem in its own way by giving OpenMined the opportunity to create a self-reinforcing positive loop:

The encryption is beneficial to the whole ecosystem, it works as an insurance for the data providers and the model owner in a decentralised and anonymous environments. Also, it alleviates the concerns one might have about sharing confidential information.
Federated learning provide model owners without enough computational power and/or enough data the access to a decentralised cluster of computational nodes and data provider to train their model.
The gradient marketplace creates financial incentives for data owners, expertise holders and computation providers to contribute to a training pipeline.

3 solutions, dozens of possible technical implementations

The exciting part of those 3 pillars is that they are actually 3 whole research fields. One could see that as a burden but we don’t. It would be legitimate to ask why on earth one would try to merge 3 research fields and their respective state of the art solutions instead of going for proven and solid existing ones.

The answer is pretty simple, Machine learning itself is a field of research where many parts still need to be understood. Good luck finding proven engineering solutions that scale on this field when you start to add a constraint like privacy on it.

This is why OpenMined is still in an exploration phase for the underlying technology that will power the ecosystem. And yet, this is exactly what makes OpenMined valuable as all those research fields have made some very interesting advances in the last few years. Maybe it’s time to merge all those advances don’t you think?

Let me point you at some articles that are valuable to get familiar with the different subjects:

Encryption: OpenMined is currently investigating MPC and Homeomorphic encryption. To get a good overview of those encryption schemes, I invite to read this article by Morten Dahl
Federated learning: OpenMined is currently investigating IPFS and their pub-sub features to build its own federated grid. To get a good overview of federated learning, I invite you to read this Google blog post.
Gradient marketplace: OpenMined is currently investigating ways to incorporate a friendly and powerful remuneration scheme. This is one of the most unique and interesting parts of OpenMined. Let me detail it for you.

A gradient marketplace for everyone

All kinds of machine learning practitioners should have an interest in OpenMined, actually even non-practitioner (like non-ML companies sponsoring Kaggle competitions).

Every technology used by OpenMined is there to power this “Gradient marketplace”. It is at the same time, the core of the decentralised training process and the accessible part for end-users. It is glueing together all our technology: IPFS, Blockchain, MPC, Trillian, Unity, etc.

As a high-level abstraction we have 4 different profiles that will interact in the marketplace:

Task sponsors: they fund a bounty for training a model
Data scientists: they propose models, initialisation and training process
Data owners: they provide the data
Computation providers: They provide computation power (but only for the case of public model and public data, more on this below👇🏻)

Nothing prevents some overlapping on those 4 categories, this is just the most generic case. Task sponsors and data scientists could belong to the same organisation, that would not make much of a difference.

Let’s explore the process in more details now.

First, task sponsors (startups, big Co., random individuals, etc.) needs a model and they have no expertise, no data, no computational capacities but they usually have a business incentive for the model. In that case, they provide a bounty to the network so the different other actors get interested in contributing. They must also provide a validation set so we can have an objective metric to compare contributions.

Second, data scientists propose some models architectures (including training and initialisation process) that are linked to the task. All those architectures are now ready to compete for the bounty.

Every data owners that have data compliant with the task can contribute to the training. 2 different possibilities here, either the data is public and data owners can just provide their data to the network. Or their data is private and in this case, they download a model, train it on their private data and upload back the updated model.

Notice that in the private case, the data never leaves the data owner even if it’s encrypted. This is why computation providers cannot play a role in the private setting.

Technically, this process creates a tree structure of models for which the root node is that task, the first layer of the tree is the model descriptions, all other layers are a different level of trained models and leaves represent the last trained models.

Occasionally, the task provider compute the validation accuracy on all existing leaves to check if a given threshold has been reached (could be a duration for the competition, an accuracy threshold, etc.)

If the threshold has been reached, the “model winner” is declared and all contributors to the given “model winner” are remunerated:

for data through improvement of the validation accuracy
for computation and RAM usage, billed per operation
for expertise by the numbers of model architectures proposed to a given problem.

To do so we use micro-payments and cryptocurrencies. As a first step, we will use an existing crypto-platform API to simplify the process and leverage the existing ecosystem. In the long term, OpenMined will just work in a fully decentralised manner leveraging existing cryptocurrencies directly.

All of this happens, if needed, in a fully encrypted manner assuring all the parties to be compliant with any privacy laws.

Ease of use, a core value

We’ve explored the 3 core technology of OpenMined but I would like to highlight one of its core values. Even in the most technical environment, ease of use is usually a deal breaker for potential users and I’m really amazed by how OpenMined try to keep the balance between the user experience and the technical challenges.

Remember that the overall goal is to simplify the development process of private ML application. This simplicity is the bond that unifies the whole architecture: abstracting away all the complexity at every step of the development is a necessity for us.

I could list all the efforts that have been made in the past in the different repositories, but that would mean very little. The important message I want to convey is that we want to give access to private machine learning to everyone, avoiding any too steep learning curves that would leave non-technical people out of the loop.

The current roadmap

If you want a more dynamic setting than just reading, you can hear it directly from Trask by watching this YouTube video made at the last worldwide hackathon.

Let me just sum up the roadmap here:

1. Open grid for public models on a public centralised dataset (done)

The first and simple step is to be able to use the overall workflow designed for the gradient marketplace on the very simple case of one user holding the data and one user training a model. Consider this achieved!

2. Federated learning: an open grid for public models on public distributed datasets (ongoing)

Implementing the first federated learning algorithm, everything is public for now. The focus is on handling all the technical details brought by asynchronous and decentralised machine learning pipelines (asynchronous SGD, delayed gradients, etc.). This is currently happening!

3. Open grid for public models on private and distributed datasets (soon)

We add the cryptographic layer to the whole technological stack! As the first step on that side, we encrypt only the data, arguably the most important and crucial part of the machine learning training process on confidential data.

4. Adding reputation and remuneration

As things start to get private, the question of trust starts to kick in. A reputation system (closely linked to the remuneration system) using the blockchain will be bootstrapped. The more you train/provide data/provide funds in an honest way on the network, the more reputation you get, the more nodes are willing to work with you!

5. Open Grid for Private models on private data

The final step of the cryptographic setting. Models get encrypted too, improving the incentive for data scientists and private companies to train their models on OpenMined.

6. Backend API for private ML applications

Finally, we can achieve the grand vision and build a set of plug-and-play API for many different contexts (games, web apps, mobile apps, etc.)

7. Meet at your own local pub worldwide on live stream to celebrate the work done 🍻

Arguably the most important step of the OpenMined community!

To finish this piece of writing, I would like to highlight the fact that we keep facing technical challenges and open questions on the different possible implementations of the above technologies. We are currently investigating many papers on those subjects to help us go forwards but if you have any knowledge in those areas, please reach out to our slack!

If you read so far, thank you for your attention. Hopes it was enlightening! Have a good day sir ☕️