Optimization Techniques

Gradient descent is the basis of many machine learning algorithms and artificial neural networks. Therefore, although more advanced versions of the gradient landing algorithm are available, it is important to know how it works to establish the basis and logic of the work.

Gradient Descent

Gradient means slope. In other words, the purpose of the gradient landing algorithm is to reduce the slope values of our loss functions to the minimum point of our function. To understand this situation, let's examine the parabola f(x) = x^4 – 5x^3 + x^2 + 21x – 18 parabolas, which is a simpler function than our lost functions. You can see the graph of the function below.

optimization techniques
f(x) = x^4- 5x^3 + x^2 + 21x – 18

The minimum value of our function is located at -32 x = -1. Our goal is to find the x value that minimizes the y value. Let's choose x = -2 as the starting point and our goal is to be able to land at -1. At the point x = -2, the value of our function is 0.

optimization techniques
x = -2

Now we must take a 1st degree derivative of our function and our result is f'(x) = 4x^3 – 15x^2 + 2x + 21. x = -2 point derivative value is -75. The sign of this value shows us the direction of the increase in function. If f'(-2) = -75 is x = -2 – we will observe an increase in our function as we go in the direction, we will observe a decrease in the + direction. Then we must always move in the opposite direction to our slope value, then our formula for updating our x value is as follows.

optimization techniques
X Update Formula

We update our x value in each iteration. According to the formula, we calculate our new x as 73. The value of our function at x = 73 is as great as 26 million 460 thousand. The error we made here was that we missed the point x = -1 by taking a very big step from the point x = -2 and achieved a greater value in our function.

optimization techniques
x = 73

To make such a mistake, we must move towards our goal with smaller but more confident steps. Then we are making a slight change to our formula and adding a learning rate coefficient to our f'(x). Our new formula is as follows.

optimization techniques
Updated Formula

The learning speed value is a parameter that we must choose when training artificial neural networks. Let's choose 0.01 for now. So if we go back to -2, according to our new formula, when we update our x value, we calculate our new x as -1.25. As you can see, we are moving towards our goal with more confident steps without missing x = -1. We calculate our f(-1.25) value as approximately -30.48 and f'(-1.25) = -12.75.

optimization techniques
x = -1.25

Instead of calculating the remaining steps manually, let's minimize our function by encoding them on python. Below you can review the code, you can also try it yourself with different start x values and different learning speed values. For example, if you start training at x = 6, you can be stuck with the local minimum at x = 3. When training artificial neural networks, local minimum points such as here may experience stuttering, and in our following articles we will mention the solutions that can be followed in such situations.

def function(x):
    # f(x) = x^4 - 5x^3 + x^2 + 21x - 18
    return pow(x,4) - 5* pow(x,3) + pow(x,2) + 21*x - 18

def turev(x):
    # f'(x) = 4x^3 - 15x^2 + 2x + 21
    return 4* pow(x,3) - 15* pow(x,2) + 2* x + 21

def guncelle(x,ogrenme_hizi):
    return x - turev(x) * ogrenme_hizi

x = -2

iteration = 50
ogrenme_hizi = 0.01

print("X initial value : ", x, ", f({}) : ".format(x), function(x))

We update our x value throughout the #50 iteration
for i in range:
    x = guncelle(x,ogrenme_hizi)
print(i +1,". iteration result x : ", x, " f({}) : ".format(x), function(x))

As for the artificial neural networks part of the work, the value of our loss function depends on the parameters of our model. These parameters are the weight values and threshold values of neurons. For example, there are a total of 41 parameters in the artificial neural network below.

optimization techniques
Artificial Neural Network

In our example above, our function depended only on the x variable. That's why we were able to visualize our function in 2 dimensions, but it is impossible to perfectly visualize our 43-variable loss function in 2 dimensions or 3 sizes. Think about the complexity of models with millions of parameters. At this point, we must find each parameter value that minimizes our loss function. This process is calculated with the help of complex mathematical formulas that we cannot mention at this time, but the logic still works the same as the example we gave above. In this part of the business, deep learning libraries come to our rescue and take care of the optimization part of our model in the background.

In the gradient landing algorithm, optimization is performed after processing each instance on the training data set and calculating the lost value of the model. This process has some drawbacks. These disadvantages: the need to work very slowly and over-memory in large data sets. In addition to the gradient descent algorithm, there are also random gradient descent and mini-batch gradient descent algorithms.

Random Gradient Descent

In the random gradient landing algorithm, a random sample is selected within each iteration, the loss value is calculated and optimization is performed. After these operations are performed for each instance in the training set, the next iteration is proceeded. The advantage of this process is that updates to the parameters are made more frequently, so the model can approach the minimum point faster and requires less memory as we process one instance at a time. The disadvantages are that even after reaching the minimum in the loss value, we need to update the parameters and move away from the minimum value and gradually adjust the learning speed during the training process.

Granular Gradient Descent

The other two gradients are the most preferred according to the type of landing. In this algorithm, the training set is divided into specific parts. For example, a training set with 128 instances can be divided into 8 pieces of 16. After the lost value of the model is calculated for each part, the optimization process is performed. Before training, we need to decide on the number of parts. Usually multiples of 2 are preferred to be 2,4,8,16,32,64.

Finally, below you can compare the minimum of these three techniques. You can animate this chart as a descent down the top of a mountain in your head. Since updates are made less on the gradient descent, the number of steps is small but we descend with large steps, and usually there are no deviations in the target, but we move more slowly. On a random gradient descent, we land quickly, with small but frequent steps, but sometimes there are deviations from the target. We can think of the parachasi gradient landing as the middle of the other two algorithms. We are moving at a medium step frequency, medium step size and medium speed.    

optimization techniques