In our previous post we discussed about what are loss functions ? How they are broadly classified ? What are different types of classification loss functions used in Deep Learning.
In this post we will focus on different types of Regression loss functions.
1. L2 Loss | Mean Squared Error Loss
Mean Squared Error or L2 Loss function is one of the most commonly used loss function in machine learning world. If you have some prior experience with machine learning, I am sure that you have heard this name before.
Mean Squared Error as the name suggests is the sum of squared distances between our target variable and predicted values. It is a regression loss function.
2. L1 Loss | Mean Absolute Error
Mean Absolute Error (MAE) or L1 Loss is another loss function that is used for regression models where the output values are continuous output value. MAE stands for sum of absolute differences between the target and predicted variables. It measures the mean/average magnitude of errors, irrespective of their directions. Since error is absolute, the range of L1 is also 0 to ∞.
You might be confused between these 2 above loss function.
Using squared error is easier to solve, but large errors have greater influence on MSE than MAE. Hence using the absolute error is more robust to outliers, since it doesn’t make use of square.
L1 loss is more robust to outliers, while L2 loss is sensitive to outliers. L2 loss gives a more stable and closed form solution, but L1’s derivative is not continuous making it difficult to find solution.
3. Huber Loss | Smooth Mean Absolute Error
Huber Loss is loss function that is used in robust regression. It is the solution to problems faced by L1 and L2 loss functions. It is less sensitive to outliers in data than the squared error loss. It’s also differentiable at 0.
The above function becomes quadratic when error value a is small and linear when a is large.
Reason for using Huber Loss?
One big problem with using MAE is its constantly large gradient when using gradient decent for training. This can lead to missing minima at the end of training using gradient descent. While with MSE, gradient decreases as the loss gets close to its minima, making it more precise.
Huber loss can be here, as it curves around the minima which decreases the gradient.
4. Log-Cosh Loss
Log-cosh loss stands for Logarithm of the hyperbolic cosine of the prediction error. It is another function used in regression tasks.
log(cosh(x)) is approximately equal to
(x ** 2) / 2 for small
x and to
abs(x) - log(2) for large
x. This means that ‘logcosh’ works similar to the mean squared error, but will not be strongly affected by the occasional wildly incorrect prediction. Along with the advantages of Huber loss, it’s twice differentiable everywhere, unlike Huber loss.
I recommend reading this post with a nice study comparing the performance of a regression model using L1 loss and L2 loss in both the presence and absence of outliers.
Also Check our previous post on Loss Functions:
Loss Functions : Part 1