Overfitting and Underfitting

We may not get the performance we want from our machine learning or artificial neural networks models at once. In fact, most of the time we don't get a good success out of our models at once. After carrying out our training process, we must perform some analyses and produce solutions to the problems we encounter in line with the analyses we make. If our model does not achieve the success we want, the problem we face is either overfitting or underfitting. In this article we will talk about what these two problems are and how we can find solutions to them.

What is Overlearning and Under-Learning?

First of all, let's briefly define these two problems. As we know, when we were training our models, we were splitting our data into training and test sets. In case of overleld learning, our model performs very well in the training set, while performing more unsuccessfully in the test set. It cannot generalize the information it obtains from the training set and as a result performs more unsuccessfully in the test set. As an example, we can give a student a very good memorization of the information before taking the exam, but when he takes the exam, he or she will encounter different types of questions and fail the exam. In case of under-learning, our model performs unsuccessfully in both training and test data. We can give this as an example of a student taking the exam without any work and taking it naturally badly.

Let's take an example to better fit these two problems in our heads visually. Let's have two classes, as we see in the picture below, and have two arguments of these two classes. Let the examples for our classes in two dimensions be distributed as follows.

overlearning
Class Distributions

Before we move on to the artificial neural networks part of the business, let's classify with logistical regression, a classic machine learning algorithm, and visualize how we will achieve a model in different situations. If we are faced with a low learning situation, we can get a model like the following.

overlearning
Under-Learning

As our model shows, it is not very successful in classification in the training set. As you can see, our model is having trouble classifying the samples that belong to class 1. For example, accuracy values in training and test sets can be 61.5% and 59.8%, respectively.

If the problem we face is overlearning, we have a very good chance of obtaining a model like the following. Here, our model learns the training set very well by making very sharp turns, but when we move to the test data, it will not be able to properly classify. Examples of accuracy values are the values written in the picture.

overlearning
Overlearning

The ideal model that we want to achieve is as follows. This model will be able to well generalize the information it has learned from the training set and, as a result, will demonstrate successful performance in both the training and test set.  

overlearning
Ideal Situation

Solutions

After we mention what the two problems are, we can proceed to the solutions in the artificial neural networks part of the job. Although we speak for artificial neural networks, most of the topics we will touch on also apply to classical machine learning algorithms.

Solutions for Under-Learning

Under-learning is an easier problem to avoid than overlearning. To address this problem, we need to increase the capacity of our model. We can increase the capacity of our model by going through changes in the structure of the model. For example, we can increase the number of neurons in layers or increase the number of layers. In this way, our model will learn the data in the training set better, but if we exaggerate, we may face overlearning.

Solutions for Overlearning

We can approach this problem in two ways:

  • Train the model on more data
  • Reducing the complexity of the model

The reason our model encounters overlearning is because it has such a capacity on the data set. In this case, we must either increase the complexity of our data set, which we can achieve by collecting more data or by data augmentation, or we must reduce the complexity of our model. In general, all of these methods are called regularization techniques. Let's take a look at these techniques.

Data Replication(Data Augmentation)

Further data collection is not always an easy process to do, but we can expand our data sets with certain methods in data sets created for computer vision problems such as image classification or object detection. For example, we can have a larger data set by adding new versions to our data set by passing our pictures through horizontal flips, flip overhead, reduce or increase brightness, randomly cut pictures, rotate pictures at different angles, add noise.

overlearning
Horizontal and Upside-Down Flip
overlearning
Random Interrupt
overlearning
Rotating at Different Angles

L2 Rectification

The purpose of this rectification is to keep the value of our artificial neural network parameters (weight) small. High-value weights are indicative of an unstable model. High weight values cause sharp transitions in the functions of neurons, and therefore very small changes in our input values have a huge impact on outputs. Then we have to keep the weights small somehow. We can achieve this through minor changes to our lost functions. For example, let's make a change to our average frame error loss function as follows.

overlearning
L2 Rectification Average Frame Error

Here we add to our loss function, we collect the square of all our weights and divide them by ε/2m. In this way, if the values of our weights are too large, our loss value will increase. Then our continuous weight values will decrease during the optimization of the model because the high weight values will cause our loss function to increase.

L1 Rectification

The only difference from L2 rectification is that the absolute values of the weights are collected instead of the square of the weights. In this rectification, weight values are more prone to 0.0. In general, L2 rectification is more preferred.

overlearning
L1 Regülarization Average Frame Error

Dropout

In this technique, neurons in each layer are randomly ignored according to the probability value given in each iteration during the training of our model. In this way, by eliminateing some neurons, we reduce our number of neurons and obtain a simpler model. From another point of view, since different neurons are eliminated in each iteration, the weight values of all neurons will be distributed closely together and there will be no imbalances between the weight values. In the following example, you can examine how to leave different iterations will yield different results.

overlearning
Dropout

Early Stopping

When training our model, it is important for us to chart the loss values in the training and test set and observe them instantly for each iteration. In this case, the point at which the loss value of the test set begins to increase is the situation in which we face overlearning. At this point, we can prevent overlearning by ending education. When we review the graph below, we can stop training in the 50th iteration and prevent a possible overlep problem.

overlearning
Early Stop