We are currently immersed in a large quantity of data and information which we can use to create efficient models of machine learning. The biggest drawback we found when we are working with supervised models is that we don’t have enough tagged data. Data tagging involves a great effort in time and cost.
In this blog post I’m going to tell you which strategies are employed to label data in a smart way to achieve more robust models with less tagged data, this is called Active Learning method. Active Learning is a branch of Machine Learning in which while training we generate new data to be tagged by human annotators.
How to use this technique?
Let’s start with an example, if we only had the capacity in time and effort to tag only 500 data in order to improve our models, this strategy will allow us to take those 500 data to optimize our model with a lower cost, the following graphic shows the process described before.
First we will train a model with a small data set that we have previously tagged. Suppose this model got a metric of X %, example accuracy 70 %, now to improve our models we have to tag much more data, here is where we introduce Active learning. Through Active Learning we select a subset of unlabeled data, for example 500 samples, we will label this samples through a human annotator and then re-training the models again. We repeat the process until we get the optimal metrics.
Active learning exists in different types of settings where it can be used, such us: Membership Query Synthesis, Stream-Based Selective Sampling, Pool-Based Sampling, among others. In this post we will see the most common used case: Pool-Based Sampling.
In this stage we assume that we have already some tagged data, (a subset L), and another set of unlabeled data U. In each iteration, a subsample, which belongs to U, is selected to be tagged for a human annotator.
There are different strategies to select a subset of data to be tagged, each of them changes depending on the problem, these are:
- Uncertainty sampling: Selecting some data that have the most uncertainty after the prediction. Uncertainty Sampling uses the entropy as a value of measure of uncertainty. Example: if we take a model of binary classification in which given a value we obtain a probability, through uncertainty sampling is selected that subset closest to 0.5 value of probability.
- Variance Reduction: This strategy selects a subset of unlabeled data in order to reduce the variance of the final model. This way we are indirectly minimising the error of generalization of the trained model. However, this strategy could be very computational expensive.
- Expected Error Reduction: Selecting a subset of data to posteriori decrease generalization error of the model, that would reduce the false positive rate.
- Query-By-Committee (QBC): This strategy trains a set of models with the data tagged. These trained model will predict new unlabeled data. At the end the data that differ the most in the prediction will be sent to tag.
There are more strategies than those mentioned above: Expected Model Change, Density-Weighted Methods  etc. For example purposes we will take the first strategy Uncertainty sampling.
In this dummy example, I will go to use iris dataset to do validation of how Active Learning helps to improve models. I will use a library called modAL  to do an example, using uncertainty sampling strategy and later I will train another model with random sampling strategy.
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltfrom sklearn.datasets import load_iris# load dataset
iris = load_iris()X_raw = iris['data']
y_raw = iris['target']
Let’s go to split a validation dataset to simulated the labeled of a human annotator.
# split dataset
n_labeled_examples = X_raw.shape
training_indices = np.random.randint(low=0, high=n_labeled_examples + 1, size=5)X_train = X_raw[training_indices]
y_train = y_raw[training_indices]# Isolate the non-training examples we'll be querying.
X_pool = np.delete(X_raw, training_indices, axis=0)
y_pool = np.delete(y_raw, training_indices, axis=0)
I going created a utils function <loop> , It has following input parameters:: Object Learner (strategy of queries to be tagged), datasets unlabeled, list to save metrics and max number of iteration.
First I training a Random Forest estimator whose query strategies is uncertainty sampling.
learner = ActiveLearner(estimator=RandomForestClassifier(),
X_training=X_train, y_training=y_train)unqueried_score = learner.score(X_raw, y_raw)
Loop of training.
clf, list_history = loop(learner, X_pool, y_pool, performance_history=[unqueried_score])
To do a validation between AL and random strategy, I going to train a new model with random query strategy.
def random_sampling(classifier, X_pool):
query_idx = np.random.choice(range(X_pool.shape))
return query_idx, X_pool[query_idx]# create a random learner object with the strategies created before.
learner_random = ActiveLearner(estimator=RandomForestClassifier(),
X_training=X_train, y_training=y_train)clf, list_history_random = loop(learner_random, X_pool, y_pool, performance_history=[unqueried_score])
In the next graph we see how Active Learning with less iterations reached best metrics that random search strategy.
As can be seen, using a strategy to label data more intelligently helps improve the model’s metrics with much less data.
To conclude, Active learning is a tool that allows us to streamline our workflow, numerous companies today are making use of these strategies in real projects, increasing adoption more and more.