You might have wondered about the use of Batch Size, Epoch and Iteration in your code as the three of them look pretty similar to each other. To get a better understanding about the same, you need to know more about optimising algorithms which will help you understand better.
After reading this, you will get :
- A better idea on gradient descent.
- A clear understanding on how the three terms (Batch Size, Epoch and Iteration) are different from each other.
But before diving into the details of those 3 terms we first need to understand what is Gradient Descent and what is the roles of those 3 terms in it.
What is Gradient Descent?
Optimization plays a vital role in machine learning. Almost every machine learning algorithm implements an optimization algorithm at its core. Gradient descent is an iterative optimization algorithm used to find the values of parameters, i.e, coefficients of a function that minimizes a cost function. You need to obtain the results multiple times to get the most optimal result which makes it iterative in nature. It is best used when the parameters cannot be calculated analytically, for instance, using linear algebra.
Gradient of a slope defines it’s inclination or declination.It simply measures the change in all weights with regard to the change in error. You can also imagine gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. When the slope becomes zero, the model stops learning. In mathematical terms, a gradient is the partial derivative with respect to its inputs.
Learning Rate in Gradient Descent
In Gradient descent, we have a parameter called learning rate. Initially the steps are bigger that means the learning rate is higher and as the point goes down the learning rate becomes smaller by the shorter size of steps. Also, the Cost Function or the cost is decreasing . For gradient descent to reach the local minimum , the learning rate should be set to an appropriate value, which is neither too low nor too high.
This is very important because if the steps it takes are too big, it may not reach the local minimum because it bounces back and forth between the convex function of gradient descent (refer to the left image below). If we set the learning rate to a very small value, gradient descent will eventually reach the local minimum but that may take some time (refer to the right image).
So, the learning rate should never be too high or too low for this reason.
Why do we use batch size, epoch and iteration?
The terminologies like batch size, epoch, iteration come into picture when the data is too big and can’t be passed all through the computer at once. So, to overcome this problem, you need to divide the data into smaller sizes and give it to the computer one by one and update the weights of the neural networks at the end of each step to fit it correctly into the data given.
So let’s understand these terms in detail.
One Epoch is basically when an ENTIRE dataset is passed once through the forward and backward of the neural network. If the Epoch is too big to feed to the computer at once, you will then need to divide it into several smaller batches.
Why do we use more than one Epoch?
You need to understand that we are using Gradient Descent to optimise the learning and the graph,which is an iterative process. Also, we are using a limited dataset. So, updating the weights with a single pass or one epoch is not enough.
It is directly proportional that as we increase the number of epochs the more number of times the weight is changed in the neural network and ultimately curve goes from underfitting to optimal to overfitting curve. The right number of epochs is one that results in the highest accuracy of our model.
The number of epochs is usually large, commonly hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized. You may see examples of the number of epochs set to 10, 100, 500, 1000, and larger.
2. Batch Size
Total number of training examples present in a single batch is known as the batch size. As you know that the entire dataset can not be fed at a time into the neural net,the dataset is divided into a Number of Batches or parts.
When all the training samples are used to create a single batch, then this learning algorithm is known as batch gradient descent. When the batch has only one sample in it, the learning algorithm is known as stochastic gradient descent. When the batch size is more than one sample but less than the size of the entire training dataset, the learning algorithm is known as mini-batch gradient descent.
- Batch Gradient Descent. Batch size = Size of training set
- Stochastic Gradient Descent. Batch size = 1
- Mini-Batch Gradient Descent. 1 < Batch size < Size of training set
In mini-batch gradient descent, popular batch sizes that are used include 32, 64, and 128 samples.
Iteration is defined as the number of batches needed to complete one epoch. To be more clear, we can say that the number of batches is equal to the number of iterations for one epoch. It can also be seen as the number of passes, each pass using a number of examples equal to that of batch size.
Note: The number of batches is equal to number of iterations for one epoch.
Understand better using an Example
Consider an example where we have 2000 training examples that we have to use . We need to then divide 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch.
1 epoch = one forward pass and one backward pass of all the training examples in the dataset
batch size = the number of training examples in one forward or backward pass.
No of iterations = number of passes, each pass using a number of examples equal to that of batch size.
You need to specify the batch size and number of epochs for a learning algorithm and thereby obtain the number of iterations. There are no magic rules on how to configure them. You need to try different values and see what works best for your problem. These parameters play a very vital role in the performance of your learning model. Be wise when you choose them.
Hope now you have clear understanding about those 3 terms. You can also check out our post on: Loss Function and Optimization Function