# House Price Prediction with Artificial Neural Networks

In today's article, we will work on the regression problem compared to the latest iris project. Before continuing with this article, if you have not read it, I suggest you read it. The data set we will use in this project contains information about houses in Boston. You can access the data set through https://www.kaggle.com/vikrishnan/boston-house-prices.

To tell you a little bit about our data set, our data set includes a total of 13 features of boston homes, such as the airborne nitric oxide rate, crime rates, student teacher rate in surrounding neighborhoods, tax values, average number of rooms per residence. As a result of these features, the houses have price information. The beauty of this data set for us is that it contains no categorical data and consists entirely of numerical data. This makes our job very easy in the data preprocessing part. We will carry out our project under four headings: data preprocessing, building the structure of the model, training and performance measurement.

## Data Preprocessing

We start by putting the "housing.csv" file in the file we downloaded from Kaggle into the folder where we will work. First, we import the necessary libraries. Then we carry out the reading of our data. In this file, because each feature of our data is separated by spaces, we perform a reading accordingly, and then separate our arguments(X) and dependent variables(Y).

``````import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# We separate dependent variables and arguments.
X = dataset.iloc[:,0:13]
Y = dataset.iloc[:,13]``````

You can check out our data set in the "variable explorer" section. There are 505 examples in our data set.

We then scale on our data to make the learning process easier. We explain in more detail why in our previous article "Iris Data Set Project". Then we complete the data preprocessing section by devoting 80% of our data for training and the remaining 20% for testing.

``````# Scaling process
scaler = preprocessing. StandardScaler()
X = scaler.fit_transform(X)

# We separate it as a set of training and testing.
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=2)``````

## Creating an Artificial Neural Network Model

Since we have 13 arguments, we'll have 13 neurons in our input layer. Our input layer will be connected to the 1st secret layer with 8 neurons, our 1st secret layer will be connected to the 2nd secret layer with 8 neurons, then our 2nd secret layer will be connected to the 3rd secret layer with 4 neurons. We will use the "ReLU" activation function in each of our hidden layers. Finally, our third secret layer will be connected to the single neuron output layer. We will not use activation function because we are dealing with the regression problem in our output layer. Since our model contains 3 or more hidden layers, we can classify our model as a deep learning algorithm. You can examine the structure of the model that we will create in the image below.

First of all, we determine our argument numbers and the neuron numbers of our output layer. Then we create the structure of our model and print the summary of our model to the console with the "summary" function.

``````# Number of neurons of input layer
input_num = x_train.shape[1]
# Number of neurons of the output layer
output_num = 1

# Create the structure of the model
model = Sequential()

model.summary()``````

There are a total of 225 parameters in our model. Of these, 204 are the weight values of neurons and the remaining 21 are threshold values of neurons.

## Model Training

We choose the "man" technique as the optimization technique that we will use when training our model. This optimization technique is a more advanced version of the gradient landing. In our following articles, we can mention the man optimization technique in more detail. As a loss function, we choose the average square error that is suitable for regression problems. We train our model for 500 iterations and validation_data during training so that we can compare the performance of our model in training and test sets during learning and see if we encounter situations such as overfitting or underfitting.

``````model.compile(optimizer='adam',loss='mse')
history = model.fit(x_train,y_train,batch_size=16,validation_data=(x_test,y_test),epochs=500)``````

Let's chart the loss values of our model in training and test sets.

``````plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epochs')

plt.legend(['train', 'test'], loc='upper right')
plt.show()``````

The graph that we will get when we run the above code will be as follows.

With the help of the magnifying glass above, we can better examine the picture if we zoom in.

As you can see, the performance of our 150th iteration model in the test data is decreasing, which means that the test loss values are increasing. This is indicative of overlearning. Although educational loss values continue to fall, we should have stopped training our model in the 150th iteration to prevent overlepulation. Since we no longer have the option to stop, it will be enough to train our model from the beginning for 150 iterations. Model by changing the epochs parameter in the "fit" function from 500 to 150. From Sequential() we have to run the code from scratch because the weight values of our model remain in their final form, so we have to reset it. If we just change the fit function and run the code from there, our model will be trained 150 more iterations on top of 500 iterations and a total of 650 iterations will be trained. After training our model for 150 iterations from the beginning, the loss values in the training and test sets will be as follows.

As you can see, we do not allow the increase in test loss values and prevent overlearning.

## Performance Measurement

In this section, we will examine the performance of our model in training and test sets. Among the performance metrics, we will use the R2 score, which is suitable for regression problems. With the "predict" function of our model, we obtain the forecast values for each sample and calculate R2 scores for training and testing.

``````train_preds = model.predict(x_train)
print("Education R2 Score : ",r2_score(y_train,train_preds))

test_preds = model.predict(x_test)
print("Test R2 Score : ",r2_score(y_test,test_preds))``````

When we run the above code, the value we receive will be as follows.

When we look at the R2 scores, we can see that our model performs well. At the same time, the proximity of the achievement values in the training and test data is a second evidence that we are preventing overlearning.

You can find all the code we have written in one piece below. I suggest you create different model structures and try to achieve better performance.

``````import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

X = dataset.iloc[:,0:13]
Y = dataset.iloc[:,13]

scaler = preprocessing. StandardScaler()
X = scaler.fit_transform(X)

x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=2)

input_num = x_train.shape[1]
output_num = 1

model = Sequential()

model.summary()

history = model.fit(x_train,y_train,batch_size=16,validation_data=(x_test,y_test),epochs=150)

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epochs')

plt.legend(['train', 'test'], loc='upper right')
plt.show()

train_preds = model.predict(x_train)
print("Education R2 Score : ",r2_score(y_train,train_preds))

test_preds = model.predict(x_test)
print("Test R2 Score : ",r2_score(y_test,test_preds))``````