What are The Lost Functions?

Loss functions are a measure of the success of artificial neural network models on the data set. If the model is successful in its predictions, the value of our loss function is low, and if it fails, it is high. During the training of our model, we follow the values of the loss function. If these values are declining, we'll know we're on the right track. When it rises regularly, we might think that some things are going wrong and that we need to make some changes to our model.

Loss functions reduce all aspects of the model to a single number by collecting them under one roof and we reach more successful models with optimizations made on this value. At this point, it is very important to choose the loss function to best represent our model in order to get the highest efficiency from our model. When choosing the lost function, we choose according to the type of problem. These problems can be regression, classification or multiple classification.

Regression Loss Functions

Mean Absolute Error

It is a measure of the difference between estimates and actual values, regardless of their direction (positive/negative). It is more convenient to use for data sets with outlier values. It is easier to interpret than other functions.

Loss Functions
Average Absolute Error

Mean Squared Error

It is calculated by taking the square of the difference between the forecast values and the actual values and then taking them as averages. The squared import process here contributes more to the values with a high margin of error. For example, if the model side of a value with a real value of 8.2 in the data set has an estimated value of 8, here our error is 0.2 and contributes 0.04 units to the loss function after the frame is taken. If our model's estimate is 2.2, our error here is 6 and its contribution to the loss function after the square is taken is 36 units. There's a 900-fold difference between the two samples.

Loss Functions
Average Frame Error

Square Root Mean Squared Error

As the name suggests, the average square error is obtained by taking the square root. It is the standard deviation of forecast errors. It is an indication of how intense the values in the data set are around our estimates.

Loss Functions
Square Root of Average Frame Error

Multiple Classification Loss Functions

Categorical Cross Entropy Loss:

Used in multiple classifications. Examples include the classification of a fruit in the picture to be apples, pears or bananas. It is usually used after softmax activation function. That's why it's also called softmax loss. As a brief reminder, the softmax activation function returns the probability value of each class as output, and the sum of these probability values is equal to 1. The formula for cross entropy function is as follows.

Loss Functions
Cross Entropy

The reason we call it categorical cross entropy here is because we use it for multiple classifications. We use cross entropy function in different variations for different classification types. In the formula, yj values show our actual values and PJ values show our forecast values. We represent our actual values as one-hot encoded when calculating the loss value. For example, in the class of apples, pears and bananas [1,0,0], we consider the vector to belong to the apple class. Our model outputs a 3-element vector with softmax activation. For example, if our model is well trained, we take the vector [0.956, 0.003, 0.41] as output. According to the formula, the loss function is affected only by the class to which it belongs because the other values in our one-hot encoded vector are ineffective because they are 0. As an example, you can review the following calculation. In addition, you can calculate the base of the logarithm in the formula as e, and as 2, there is a difference that is too small to be considered.

Loss Functions
Categorical Cross Entropy Example

Then, after calculating the loss values for all values in the data set, the loss value on the model's data set is calculated by taking an average. The value m in the formula represents the number of instances.

Loss Functions
Categorical Cross Entropy Data Set

You can also review python code.

from math import log

def cross_entropy(p,q):
    return -sum([p[i]*log(q[i]) for i in range(len(q))])

coklu_sinif = [[1,0,0],[0,1,0],[0,0,1],[1,0,0],[1,0,0],[0,0,0],[0,1,0]]
coklu_sinif_tahmin = [[0.85.0.1.0.05],[0.03.0.8,0.17],[0.02,0.01,0.97],[0.75.97],[0.75.0 0.21,0.04],[0.63,0.02,0.35],[0.02,0.1,0.88],[0.05,0.75,0.2]]

coklu_sinif_sonuclar = []

for i in range(len(coklu_sinif)):
    # we calculate the loss value for each sample
    coklu_yitim = cross_entropy(coklu_sinif[i], coklu_sinif_tahmin[i])
    
coklu_sinif_sonuclar.append(coklu_yitim)
    
average = np.mean(coklu_sinif_sonuclar)

print("Loss value on data set : ", average)

Binary Classification Loss Functions

Binary Cross Entropy Loss

It is used in binary classification, for example, the classification of an audience to be benign or malignant. It is usually used after the sigmoid activation function.

Loss Functions
Binary Cross Entropy

In the formula, the p1 value is the estimate of our model, and the y1 value is the actual value. The reason we use cross entropy in this way is that we do not keep values as one-hot encoded in binary classifications. With one value, we can keep the probability values of both classes. For example, if our sigmoid activation function gave a value of 0.83 as output, the probability of the other class is 1 – 0.83 to 0.17. Let's examine the underlying mathematics of the formula. If our y1 (actual value) value is 1, it is automatically ineffective because the term to the right of the formula will be 1 – y1 to 0. The remaining value is -log(p1). You can see the graph of -log(p1) below.

Loss Functions
-log(p1) Chart

Since our p1 value can only be valued between 0 and 1, we can only focus on the range [0.1] of the chart. We can extract from the chart that as our p1 value goes to 0, our chart goes forever. Our true value was 1. In other words, as our forecast value moves away from our actual value, we add very high values to our loss function. In a way, we threaten our model by increasing the loss function as the forecast value moves away from the actual value. Our p1 value decreases as we move towards 1 and becomes 0 at point 1. We're not interested in the rest of the chart. Let's also examine the situation where our true value is 0. This time, the y1*log(p1) is 0, leaving -log(1 – p1). You can see the graph of -log(1-p1) below.

Loss Functions
-log(1 – p1) Chart

Again, we focus only on the [0.1] range of the chart. We comment on this chart as follows: as our p1 value approaches 1, our function goes on forever. Our true value was 0. So again, the previous logic works the same way. As our estimate value moves away from what it should be, our loss value increases and it becomes the value it should be. To calculate the loss value on the model's dataset, the loss values must be averaged after they are calculated for each instance. You can review the python code written for the binary cross-loss function below. Here you can compare the loss value between well-made and poorly made prediction on the same data.

from math import log
import numpy as np

def binary_cross_entropy(p,q):
    return -p*log(q) - (1 - p)*log(1 - q)

# binary classification 
gercek_degerler = [1, 1, 0, 0, 1, 1, 0, 0]
# Good forecast values
tahmin_degerleri = [0.9 0.8 0.3 0.1 0.95 0.8 0.23 0.05]
# bad forecast values
tahmin_degerleri_2 = [0.4 0.2 0.91 0.8 0.3 0.15 0.6 0.7]

the endings = []

for i in range(len(gercek_degerler)::
    ikili_yitim = binary_cross_entropy(gercek_degerler[i],tahmin_degerleri[i])
    
sonuclar.append(ikili_yitim)
    
ortalama_1 = np.mean(ends)
    
print("Loss value on data set : ", ortalama_1)

sonuclar_2 = []

for i in range(len(gercek_degerler)::
    ikili_yitim = binary_cross_entropy(gercek_degerler[i],tahmin_degerleri_2[i])
    
sonuclar_2.append(ikili_yitim)
    
ortalama_2 = np.mean(sonuclar_2)
    
print("Loss value on data set : ", ortalama_2)
Loss Functions
Output

In our next article, we will examine how to optimize loss functions.