In recent times, Deep Learning has created a significant impact in the field of computer vision, natural language processing, and speech recognition. Due to the large amounts of data being generated day after day, it could be used to train Deep Neural Networks and is preferred over traditional. But only picking data and feeding them into neural network won’t create a state of art model. That’s where hyperparameters come into picture.
Even though Deep Learning but choosing the optimal hyperparameters for your Neural Networks is still a Black Box Theory for us.
You need to understand that Applied Deep Learning is a highly iterative process. While training the model there are various hyperparameters you need to keep in your mind. Getting the optimal values for hyperparameters is quite a trial and error approach. Also it requires years of experience to find the optimal values for the model.
In this guide, We are going to describe powerful and effective ways of choosing the optimal hyperparameters for your state of art model. The hyperparameters are-
I am going the divide this guide into two parts. In this part, I will talk about the learning rate and the batch size.
After reading this post, you will also understand the important machine learning concepts such as Overfitting, Underfitting, Bias-Variance tradeoff. And most important how to choose the parameters.
Consider a scenario where we have trained a classification model on 10,000 images with their labels. When we test the model on that same data set we get a mind boggling accuracy of 99%
But the catch is when we try the same model on a new unseen dataset, we only get an accuracy of 50% and fails to performs well.
This is a perfect example of overfitting as our model wasn’t able to generalize well on the unseen dataset.
Our goal while training a network is to have it generalize well on the training data which means capturing the true signal of the data rather than memorizing the noise in the data. In statistics, it is termed as “Goodness of Fit” which refers to how well our predicted values match to the true (ground) values.
Bias-Variance Tradeoff is one of the important aspects of Applied Machine Learning. It has simple and powerful implications around the model complexity and its performance.
We say there’s a bias in a model when the algorithm is not flexible enough to generalize well from the data. Regression and Naive Bayes tend to have a high bias which are know as Linear parametric algorithms.
Variance occurs in the model when the algorithm is sensitive and highly flexible to the training data. Decision trees, Neural Network, etc tend to have high variance which are known as Nonlinear parametric algorithms
There are various ways to find the balance of bias and variance for each algorithm family by using methods such as regularization, pruning, etc.
But in this post, I’m going to only talk about finding the balance for the Neural Networks.
Overfitting vs Underfitting
Do you know what’s one mistake that could single handly ruin the performance of your model?
It’s one of the trickiest problems one faces in applied machine learning.
It is Overfitting. Because, when your machine learning model is tested on the unseen data in the “real” world, it will come across data which wouldn’t be in the training data. Therefore, it is necessary for the model to generalize well on the data.
Overfitting occurs when the model tries to memorize the noise instead of the data. Usually, complex algorithms such as Neural Networks are prone to overfitting.
When models are not able to capture the true signal from the data they tend to underfit. Usually Under Fitted models have bad accuracy in training as well as on the test data because they haven’t learned any pattern in training data.
Gradient Descent Algorithms
It is one of the most popular and used algorithms for weights optimization of the models. Gradient Descent helps in minimising the cost function by updating the parameters which are weights in opposite direction of the gradient of the cost function. It has a learning rate, a hyperparameter, which helps us control the adjustment of weights for our network to our loss gradient. Basically Learning Rate refers to the size of steps the gradient descent takes to reach the local optima. That’s our goal to find the optimal weights. We will talk about choosing the right learning rate for our network later as it plays an important role.
There are in total of three variants of Gradient Descent. They differ by the way they take data for computing the gradient of the cost function. There’s a tradeoff over here depending upon the Gradient Descent variant we choose, which is the accuracy of the update of weights and the time it takes for each epoch i.e each iteration for the weights update.
Our main goal while training a neural network is to find the optimal weights to get accurate results. The distinct feature that makes Deep Learning better than traditional machine learning is the fact that we don’t need to do any feature engineering.
Weights are the parameters which we can control to optimize the performance of model. For finding optimal weights we need to focus more on the network architecture, activation functions, and the hyperparameters.
Learning Rate basically helps us to tweak the rate of change in weights for our network while performing gradient descent algorithm.
The training of your model will progress very slowly if you keep your learning rate low as it would be making tiny adjustments to the weights in the network and it could also lead to overfitting. Whereas, when you set a high learning rate, the model misses the local optima and keeps on bouncing one peak to another peak. But having a large learning rate helps in regularizing the model well to capture the true signal.
We can do a random grid search to find the optimal learning rate that converges but there are easier ways as it is computationally heavy and time-consuming.
Learning Rate Schedule
The Learning Rate Schedule as you might have guessed it correct that its changing the learning rate at some predefined schedule. This happens while training the model. Different types of learning rate schedules are time-based decay, exponential decay, and step decay.
Constant Learning Rate
We can use the default learning rate while training the network, which we could use as a baseline to compare our performances and to get a rough idea of how the network is behaving. We can set the momentum and time decay as default which is zero.
Adaptive Learning Rate Method
The drawback of using learning rate schedules is that we have to predefine the learning rates before the training process but we couldn’t choose the right ones with intuition as it depends totally on the model we have and the domain problem we are working on.
There are many deep learning frameworks where we can easily implement the above adaptive optimizers. We recommended that for these hyperparameters you should keep the optimizers as default.
Learning rate Range test
We can obtain valuable information from it by a single run as when we choose a small learning rate, the cost function tends to converge well hitting the optima. But when we increase the learning rate, we will come across a threshold from which the validation loss starts increasing and the accuracy drops for the model. That’s when we get an idea about the optimal learning rate for our model and data.
The amount of regularization we do should be balanced depending on the data and the model we are implementing.
Batch Size refers to the number of training samples that is propagated through the neural network.
Consider a scenario where we have a training dataset of 1000 sample. So, if we set a batch size of 100 then the network will consider every 100 samples for the training at a time.
While practicing deep learning, it is our goal to obtain the maximum performance by minimizing the computational time required. Ideally hyperparameter learning rate doesn’t affect the training time but the batch size can have a good impact on it.
Some papers recommend using a larger batch size which could be supported by your system’s memory. Many researchers suggest modifying the batch size over changing the learning rate.
Larger batch sizes tend to have low early losses while training. Whereas the final loss values are low when the batch size is reduced.
Hyperparameters play a significant role as they can directly control the behavior of the training algorithm. Choosing suitable hyperparameters plays a crucial role in the success of our neural network architecture. If we choose them effectively, it will eventually lead to the better performance of our machine learning model. Choose wisely!
You can also check out our this post : Loss Functions and Optimization