Loss Functions and Optimization Algorithms

Choosing Optimization Algorithms and Loss Functions for a model especially for deep learning model plays a major role in building optimum and faster results. In this blog we are going to discuss about various loss functions and optimization algorithms

If you new to deep learning, I strongly recommend to through the basics of perceptron,  how weights and bias affect the output and various activation functions.

We will start with some basic terms

Error is a measure of how good an algorithm can able to predict values for previously unseen data. It is calculated as a difference of actual output and the predicted output by the model.

Our intent to minimize or maximize this function is also known as the objective function. We may also call it the cost function, loss function, or error function when out focus is to miniize it


Loss Function J is a function that is used to compute error. Different loss functions will result in different errors for the same prediction, and thus having considerable effect on the performance of the model. It is a function of internal parameters weights and bias. For a model to be accurate we need to minimize the error. This can be done in many ways for example through back propagation in neural networks. The weights are modified using called Optimization Functions. Example gradient – To know more about gradient I recommend you to read my previous post (Understanding RNN)

Loss functionCost function
It is for single training exampleIt is the average loss over the entire training dataset

Types of Loss Functions –

  1. Regression Loss Functions
    1. Squared Error Loss
    2. Absolute Error Loss
    3. Huber Loss
  2. Binary Classification Loss Functions
    1. Binary Cross-Entropy
    2. Hinge Loss
  3. Multi-class Classification Loss Functions
    1. Multi-class Cross Entropy Loss
    2. Kullback Leibler Divergence Loss

Regression Loss Functions –

It is used in regression type of problems. We have to find a best fit line which gives more accurate predictions. Here we use gradient descent as an optimization strategy to find the best fit line.

Steps to be followed in regression loss functions–

  1. f(X) is an predictive function, we need to find the parameters
  2. loss for each training sample
  3. Average loss for all samples
  4. Find gradient for the cost function
  5. Fix learning rate and update weights
1. Squared Error Loss

It is the square of the difference between the actual and the predicted values calculated for each training sample and also known as L2 loss. This cost function less robust to outliers

2. Absolute Error Loss

It is the distance between the predicted and the actual values, irrespective of the sign and also known as L1 loss

3. Huber Loss

It is the combination of best part of the above two loss functions

Binary Classification Loss Functions

It is used in classification type of problems. We have to assign an object out of two classes in case of binary classification problem according to similar behavior.

On an example (x,y), the margin is defined as y f(x). it is a measure of how accurate we are. Some classification algorithms are:
1. Binary Cross Entropy
2. Negative Log Likelihood
3. Margin Classifier
4. Soft Margin Classifier

1. Binary Cross Entropy Loss

Entropy – It is a degree of uncertainty.

A greater value of entropy for a probability distribution indicates a greater uncertainty in the distribution. Similarly, a small value indicates a more certain distribution.

We want to minimize the value of uncertainty.

This is also called Log-Loss. For calculating the probability (P), we can make use of sigmoid function. Z is a function of our input features:

2. Hinge Loss

It is mostly used in SVM problems which have class labels as -1 and 1 instead of 0 and 1.  The hinge loss panelizes right predictions that are not confident.

Multi-Class Classification Loss Functions

It is used in problems where a particular object belongs to multiple classes

1. Multi-Class Cross Entropy Loss

It is generalization of the Binary Cross Entropy loss.

2. KL-Divergence

The Kullback-Liebler Divergence is used to measure of how a probability distribution differs from another distribution. A KL-divergence of zero denotes that the distributions are identical.

Embedding loss functions:

It deals with problems where we have to measure whether two inputs are similar or dissimilar.
1. L1 Hinge Error- Calculates the L1 distance between two inputs.
2. Cosine Error- Cosine distance between two inputs.

Optimisation Algorithms

Optimisation Algoritms are used to update weights and biases to reduce the error.

Constant Learning Rate Algorithms:

Stochastic Gradient Descent falls under this category.

Here ηis called as learning rate which is difficult to find. We need to find an optimum value.

Adaptive Learning Algorithms:

Discussing about an alternative to classical SGD, Adaptive gradient descent algorithms comes into picture. Eg are Adagrad, Adadelta, RMSprop, Adam . They have per-parameter learning rate methods, which provide heuristic approach without requiring expensive work in tuning hyperparameters for the learning rate schedule manually.

Thanks for having this wonderful read, Follow our website to learn the latest technologies, and concepts. Xpertup with us.
You can also check out our post on: Unsupervised learning with Python

Spread the knowledge


Let's Expert Up

Leave a Reply

Your email address will not be published. Required fields are marked *