Performance Metrics

In this article, we will talk about how we can measure the success of our machine learning models on data sets and different performance metrics. First of all, let's talk about the importance of these measurements. Most of the time, when working on a problem, we have multiple machine learning models and we want to choose the most successful of them all. In this case, we measure the performance of each model individually and proceed with the most successful model according to this result. Another important point is that we would like to compare the success of our model in training and test sets. In this way, we recognize the problems of our model and produce solutions to these problems. We use different performance metrics for regression and classification.

Regression Performance Metrics

As we mentioned in our article What are The Functions of Loss, loss functions are a measure of the success of our models. In addition to the loss functions, we will touch on R-square and adjusted R-square metrics.

The R-square metric is a measure of how good our model's predictions are. It can take a top 1 and a low of minus values. The closer our R-squared value is to 1, the higher the success of our model over the data. The formula is as follows.

Performance Metrics

The sum of the average differences in the formula gives us the error of the worst possible model. What's the worst model? To take the average of dependent variables (output) in the data set without training any AI model and present this average value as an estimate to all samples. If the sum of error frames indicating the error of the model we are training is greater than this value, our R-square value is negative. So we've trained a model that's worse than the worst possible model. As the rate decreases, our R-squared value increases and approaches 1. You can compare the R-squared values of well-made and poorly made predictions in the following sample code.

from sklearn.metrics import r2_score

gercek_degerler = [10,11.2,13,20,9,8.5,7.3]

iyi_tahminler = [,,]
kotu_tahminler = [15,9,17,15.3,5.5,6.3,11.5]

print("R^2 score of well-made estimates : ", r2_score(gercek_degerler,iyi_tahminler))
print("R^2 score of poorly made estimates : ", r2_score(gercek_degerler,kotu_tahminler))


The R-squared value has a drawback. Sometimes we add new arguments to increase the success of our model, but the R-squared value will always increase, even if the argument we add has a bad effect on the model. Using a corrected R-frame at this point will benefit us. The formula is as follows.

Performance Metrics

Classification Performance Metrics

Confusion Matrix

The table you see below is a matrix of complexity.

Performance Metrics
Complexity Matrix

Using the values in this matrix, we can calculate various metrics. But first, let's talk about what the terms in the table mean. True positivity indicates that our model's forecast is positive, and that's true, so our actual value is positive. At first, the use of terms such as true and positive may seem a little complicated, let's reinforce this with an example. Let's classify an audience as benign or malignant. Let's take the benign class as positive(1) and the malignant class as negative(0). Then the correct positive indicates that the prediction of our model is benign and that the audience is indeed benign. False positivity indicates that our model's prediction is benign but the actual value is malignant if we go from the same example. True negative indicates that our model's prediction is malignant and the actual value is malignant. False negative indicates that our model's forecast is malignant, but the actual value is benign. In the following code example, you can use the sklearn library to examine how you can achieve the complexity matrix.

from sklearn.metrics import confusion_matrix

# 4 XP, 2 DN, 3 YP, 2 YN
gercek_degerler = [0.0,1,0,1,1,1,1,0,1,0] 
tahmin_degerleri = [1,0,1,1,0,1,1,0,1]



The number of estimates we make correctly is included in the number of estimates we make. The formula is as follows.

Performance Metrics

For example, we calculate the accuracy value of the complexity matrix we have shown above (40 + 50) / 90% from 100. Accuracy may not always give us a reliable result. Here's how to explain it. Let's say we have an unstable data set. The imbalance occurs because the instance that belongs to one class is outnumbered by instances belonging to other classes. Let's say there are 95 positive classes and 5 negative classes in the data set we have. If our algorithm says everything is positive without learning anything, the accuracy rate in this data set is 95%. This value is very successful, but it is also misleading because the algorithm we have gives a positive result to this thing without learning anything, and in the wrong class samples it is a very unsuccessful model.


It is the ratio of how many of the samples we positively predicted (DP + YP) were correctly estimated. The formula is as follows.

Performance Metrics


Positive estimation (DP + YN) is an indication of how many of the required samples are predicted proportionally correctly. The formula is as follows.

Performance Metrics

In the following code example, you can see how you can calculate accuracy, accuracy, and sensitivity.

from sklearn.metrics import recall_score, precision_score, accuracy_score

# 4 XP, 2 DN, 3 YP, 2 YN
gercek_degerler = [0.0,1,0,1,1,1,1,0,1,0] 
tahmin_degerleri = [1,0,1,1,0,1,1,0,1]

print("Accuracy : ", accuracy_score(gercek_degerler,tahmin_degerleri))
print("Sensitivity : ", recall_score(gercek_degerler,tahmin_degerleri))
print("Accuracy : ", precision_score(gercek_degerler,tahmin_degerleri))

As an example for an unbalanced data set, we can examine the following complexity matrix.

Performance Metrics
Unbalanced Data

According to our data set, the number of positive classes is 160 and the number of negative classes is 840. First, let's calculate the accuracy value. According to our formula, our accuracy value (820 + 80) / 1000 is 90%. It is a very successful value, but if we calculate the sensitivity, it comes out from 80 / (80 + 80) to 50%. An unsatisfactory value of success for us. We also calculate the accuracy from 80 / (80 + 20) to 80%. There's a critical point here. When creating a matrix of complexity on unstable data sets, we must select the undernumbered instance as a positive class. In the example we have done above, if we reverse the classes, we will get the following matrix.

Performance Metrics
Reversed Matrix

Our accuracy will remain unchanged and will still be 90%. Our accuracy value will rise from 820 / (820 + 80) to 91.1% and our sensitivity value will rise from 820 / (820 + 20) to 97.6% and will give a misleading result.


F1 combines precision and precision to reduce the score to a single number. The formula is as follows. We use the F1 score to choose the best model from multiple models when we have certainty and accuracy values. The F1 score gets the highest 1, while the lowest gets 0. In the following example, you can examine how you can calculate F1 scores of different models.

Performance Metrics
F1 Score

In addition to classification metrics, there are ROC-AUC curves. We will examine these values in more detail in our following articles.