10 steps to bootstrap your machine learning project (part 1)
Working on a new machine learning project is always exciting, indeed. As a technical person, your first move would probably be to open your favorite code editor and start the magical stuff. Do not go that fast ✋.
At MetaFlow, we found that in order to gain intuition and be able to work faster and faster on AI projects, following some methodology helps. Here is ours!
- Define your task 🏁
You should define what is the goal of your project, what are its inputs and outputs and what kinds of constraints does it have?
Let’s take an example. Suppose the goal of your project is to be able to predict what someone will order for dessert in a restaurant, based on what he already ate.
If we already have a list of our meals as a list of classes, the input of your task would be what the client already ate (several classes’ ids), its output, what he’s most likely to order (a class id). About constraints? It should run on tablets waiters already have (memory efficiency) and give an answer in less than a second (speed efficiency).
2. Define the dataset
How will you feed your neural network? To do so, you can search for public datasets, licensed ones or have/build your own.
Getting an existing dataset
There are plenty of public datasets, here are a few examples:
- YouTube 8M videos: 8 million labeled YouTube videos
- Facebook’s bAbI: Q&A dataset
- Wikitext: more than 100 million tokens extracted from Wikipedia
- Data.gov: U.S. Government’s open data (more than 190k datasets!)
Your own dataset
In some cases, you’ll want some data that exists on the internet but where no official dataset exists. This is where you’ll have to create a proper web crawler. Many libraries exist to create a web crawler in no time. At MetaFlow, we’re big fans of Casper.js but use the one you like the most.
Example: let’s say your project is to create a chatbot capable of talking like Eminem would. One dataset you would need is a compilation of all its songs’ lyrics. There is no such public dataset. Thankfully, you’ll find pretty easily his lyrics on Genius and thus, create a crawler that will create your dataset.
Another way of having your dataset is by saving your own data. If your goal is to automate an existing product, you probably already have those. Otherwise, you could create a first app, powered by humans (or workers from mechanical turk) which goal is to collect the data you need.
Let’s get back to our initial example: predicting what the client will order for dessert. The dataset we need here is the history of what the restaurant’s clients ever ordered. We probably already have that data somewhere. A simple formatting of them could be enough.
At this point, you should not spend too much time checking the size of your data. This is something we’ll check with our first results.
Something interesting to do with your dataset is data-visualization: t-SNE / trees if you are working with words, histograms showing the distribution of your data if you are working with classes. To know more about t-SNE, read this excellent article from Distill.
3. Split your dataset ✂️
Ok, this one is easy. Just split your dataset into three groups: the training set, the dev (or cross validation) set and the test set. A classic split is 70%, 15% and 15% of your whole dataset.
One point that might be tricky is the distribution of your dataset before splitting it. It is indeed important for your training set, dev set and test set to have enough examples of each class you are trying to predict. Make sure that your test and dev sets are from same statistical distribution.
For example, if you are trying to predict if there’s a dog or a cat on pictures, your training set should be able to learn what a dog is and what a cat is. It should have both examples in its dataset. So should it be in the cross validation and test sets.
4. Define your metrics 📊
Now, you should have your problem defined and your dataset ready to fuel your neural network. However, it’s still not the time for you to code (I know you wanted to). It’s time to define the metrics you want to achieve with your project: what should be the quality of the answer given by your algorithm?
It can be difficult to have a sense of what is possible or not. That’s why it’s important to take a look at recent academic papers (on arxiv or google scholar) on the domain you’re working on and check what metrics are being used and what their results are.
For example, if you are working on a sum-up problem, check the ROUGE metric.
Here is a cheat sheet of binary classification metrics you can use for your project: http://www.damienfrancois.be/blog/files/modelperfcheatsheet.pdf.
Sometimes it can be more costly to make one kind of a mistake than another. Let’s say you are working on a rare disease classifier. This disease only happens at 0.0001% of your patients. Thus, if we create a simple model always predicting that the disease is not present, we’ll have a very good accuracy. However, this method is really dangerous for the potential ill-patients. Rather than measuring the error rate of our classifier, we should measure its Recall and its Precision.
The Recall of our classifier is the number of positive results predicted divided by the total number of positive elements.
It measures how many positive items we have missed during our prediction.
In order to achieve a 100% Recall rate, we should never be wrong when predicting that someone is ill.
The Precision of our classifier is the number of correct predicted positive results divided by the total number of predicted positive results.
It measures how many negative items we have missed during our prediction.
To achieve a 100% Precision rate, we should never be wrong when predicting that someone is not ill.
It is easy to have a 100% Precision or Recall rate (separately, without looking at the other one). That’s why we usually try to maximize the following F score:
You can also define your model not always to answer. This can be useful if you do not want your machine learning algorithm to predict a wrong answer.
For example, if you are working on a chatbot answering your customers’ questions, this chatbot will not always be able to select the right template to use to answer. What should it do? You can define an accuracy threshold under which your chatbot will simply not answer.
This score is called “coverage”, it’s the percentage of predictions your algorithm can answer.
5. Establish a baseline
A baseline is the simplest method you can think of, that solves your problem. It can use heuristics, simple statistics, logistic regression, randomness, regular expressions, etc. The goal of the baseline is to have a reference metric you should beat with your ML algorithm.
By having a look at Scikit-learn’s dummy estimators, here are a few examples for a classifying problem:
- Stratified: generates random predictions by respecting the training set class distribution.
- Most frequent: always predicts the most frequent label in the training set.
- Prior: always predicts the class that maximizes the class prior.
- Uniform: generates predictions uniformly at random.
- Constant: always predicts a constant label that is provided by the user.
If you are working on a NLP problem, it can also be a regular expression predicting the class some text belongs to.
Run your baseline on your test set and compute your first metrics you defined on the previous step. Keep those metrics in mind as you’ll soon want to go way ahead of them!
If it already beats your expectations and your defined goals, you might be reconsidering your problem 🕵.
Here are our first steps when working on a new machine learning project. You can find the five last steps here.
Share your own methodology in the comments!