Activation function help to determine the output of a neural network. These types of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction.
Activation function also helps in achieving normalization. The value of the Activation function ranges between 0 and 1 or -1 and 1.
In a neural network, inputs are fed into the neurons in the input layer. We will multiply the weights of each neuron to the input number which gives the output of the next layer.
The activation function is placed after the hidden layer which is in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending upon some threshold value. Neural networks use non-linear activation functions so that they can learn any kind of complex data and always try to give an accurate result.
Activation Functions are used to control the outputs of our neural networks, across different domains from object recognition and classification to speech recognition, segmentation, scene understanding and description, machine translation test to speech systems, cancer detection systems, fingerprint detection, weather forecast, and self-driving cars, validating categorically that a proper choice of activation function improves results in neural network computing.
There are many different kinds of activation functions and we are discussing a few out of them:
- Sigmoid Function
- Tanh Function
- ReLU Function
- Leaky ReLU
- ELU Function
- Soft plus
1. Sigmoid Activation Function
A sigmoid function is a type of activation function, and more specifically defined as a squashing function. The range of output in Squashing functions is between 0 and 1, making these functions useful in the prediction of probabilities. The formula for this function:
The Sigmoid function is the most frequently used activation function at the beginning of deep learning. It is a smoothing function that is easy to derive.
In the sigmoid function, we can see that its output is in the open interval (0, 1). We can think of probability, but in the strict sense, don’t treat it as a probability. The sigmoid function was once more popular. It can be thought of as the firing rate of a neuron.
Advantages of Sigmoid Function: –
- Easy to understand and is used mostly in the shallow network.
- Output values range between 0 and 1, normalizing the output of each neuron.
- Gives a clear prediction.
Sigmoid has three major disadvantages:
- Sharp damp gradients during backpropagation.
- Function output is not centered on 0, which will reduce the efficiency of the weight updates.
- Performs exponential operations, which is slower for computers.
2. Tanh or Hyperbolic tangent Activation Function
The hyperbolic tangent function is known as tanh function, which is a smoother zero-centered function whose range lies between -1 and 1, thus the output of the tanh function is given by:
The main advantage provided by the function is that it produces zero centered output thereby aiding the back-propagation process. The function is differentiable. Tanh is a hyperbolic tangent function. The curves of the tanh function and sigmoid function are relatively similar. Let’s compare them. First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval.
The output interval of tanh is 1, and the whole function is 0-centric, which is better than sigmoid.
In general binary classification problems, the tanh function is used for the hidden layer and the sigmoid function is used for the output layer. However, these are not static, and the specific activation function to be used must be analyzed according to the specific problem, or it depends on debugging.
3. ReLU Activation Function
The ReLU (Rectified Linear Unit) function is an activation function that is currently more popular. Compared with the sigmoid function and the tanh function. ReLU is a non-linear activation function that is used in multi-layer neural networks or deep neural networks. This function can be represented as:
According to equation 1, the output of ReLU is the maximum value between zero and the input value. The output is equal to zero when the input value is negative and the input value when the input is positive. Thus, we can rewrite equation 1 as follows:
The ReLU function is actually a function that takes the maximum value. Note that this is not fully interval-derivable, but we can take sub-gradient, as shown in the figure above. Although ReLU is simple, it is an important achievement in recent years. Recently, the ReLU function has been used instead to calculate the activation values in traditional neural networks or deep neural network paradigms. The reasons for replacing sigmoid and hyperbolic tangent with ReLU consist of:
- Computation saving – The ReLU function is able to accelerate the training speed of deep neural networks compared to traditional activation functions since the derivative of ReLU is 1 for positive input. Due to constant, deep neural networks do not need to take additional time for computing error terms during the training phase.
- Solving the vanishing gradient problem – The ReLU function does not trigger the vanishing gradient problem when the number of layers grows. This is because this function does not have an asymptotic upper and lower bound. Thus, the earliest layer (the first hidden layer) is able to receive the errors coming from the last layers to adjust all weights between layers. By contrast, a traditional activation function like sigmoid is restricted between 0 and 1, so the errors become small for the first hidden layer. This scenario will lead to a poorly trained neural network.
Of course, there are a few disadvantages:
- When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmoid function and tanh function.
- We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function.
4. Leaky ReLU Activation Function
The leaky ReLU, proposed in 2013 as an activation function that introduces some small negative slope to the ReLU to sustain and keep the weight updates alive during the entire propagation The alpha parameter was introduced as a solution to the ReLU’s dead neuron problems such that the gradients will not be zero at any time during training. The leaky ReLU computes the gradient with a very small constant value for the negative gradient α in the range of 0.01 thus the leaky ReLU activation function is computed as:
The leaky ReLU has an identical result when compared to the standard ReLU with an exception that it has non-zero gradients over the entire duration so when compared to standard ReLU and tanh there is no significant improvement observed except in sparsity and dispersion.
5. ELU (Exponential Linear Units) Function
The exponential linear units (ELUs) is another type of activation function proposed in 2015, and they are used to speed up the training of deep neural networks. The exponential linear unit (ELU) is given by:
ELU is also proposed to solve the problems of ReLU. Obviously, ELU has all the advantages of ReLU:
- No Dead ReLU issues.
- The mean of the output is close to 0, zero-centered.
One small problem is that it is slightly more computationally intensive. Similar to Leaky ReLU, although theoretically better than ReLU, there is currently no good evidence in practice that ELU is always better than ReLU.
6. PReLU (Parametric Rectified Linear Units) Function
PReLU is also an improved version of ReLU. In the negative region, PReLU has a small slope, which can also avoid the problem of ReLU death. Compared to ELU, PReLU is a linear operation in the negative region. Although the slope is small, it does not tend to 0, which is a certain advantage. The PReLU is given by:
We look at the formula of PReLU. The parameter α is generally a number between 0 and 1, and it is generally relatively small, such as a few zeros. When α = 0.01, we call PReLU as Leaky ReLU, it is regarded as a special case PReLU it.
Above, yᵢ is an input for the ith channel, and ai is the negative slope controlling parameter and its learnable during training with back-propagation.
- if aᵢ=0, f becomes ReLU
- if aᵢ>0, f becomes leaky ReLU
It was proposed that the performance of PReLU was better than ReLU in large scale image recognition and these results from the PReLU were the first to surpass human-level performance on visual recognition challenge.
7. Soft plus Function
The Soft plus activation function is a smooth version of the ReLU function which has smoothing and nonzero gradient properties, thereby enhancing the stabilization and performance of deep neural network designed with soft plus units.
The soft plus is given by: f(x) = ln(1+exp x)
The Softplus function has been applied in statistical applications mostly however, a comparison of the Soft plus function with the ReLU and Sigmoid functions, showed improved performance with lesser epochs to convergence during training, using the Softplus function.
8. Maxout Function
The maxout activation function is defined as follows:
Where w = weights, b = biases, T = transpose.
The Maxout, proposed, generalizes the leaky ReLU and ReLU where the neuron inherits the properties of ReLU and leaky ReLU where no dying neurons or saturation exist in the network computation. The Maxout activation function is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function.
Maxout can be seen as adding a layer of activation function to the deep learning network, which contains a parameter k. Compared with ReLU, sigmoid, etc., this layer is special in that it adds k neurons and then outputs the largest activation value.
The major drawback of the Maxout function is that it is computationally expensive as it doubles the parameters used in all neurons thereby increasing the number of parameters to compute by the network.
9. Swish (A Self – Gated) Function
The Swish activation function is one of the first compound activation function proposed by the combination of the sigmoid activation function and the input function, to achieve a hybrid AF. The properties of the Swish function include smoothness, non-monotonic, bounded below, and unbounded in the upper limits. The Swish function is given by: y = x * sigmoid (x)
Swish’s design was inspired by the use of sigmoid functions for gating in LSTMs and highway networks. We use the same value for gating to simplify the gating mechanism, which is called self-gating.
The advantage of self-gating is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as ReLU) without changing the hidden capacity or number of parameters.
- Unboundedness (unboundedness) is helpful to prevent gradient from gradually approaching 0 during slow training, causing saturation. At the same time, being bounded has advantages, because bounded active functions can have strong regularization, and larger negative inputs will be resolved.
- At the same time, smoothness also plays an important role in optimization and generalization.
10. Softmax Function
The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1 so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if the input is large, then it turns it into a large probability, but it will always remain between 0 and 1. The Softmax function is computed using the relationship:
Where zi is an input vector that takes all the input value which is real values. The term on the bottom of the formula is the normalization term which ensures that all the output values of the function will sum to 1, thus constituting a valid probability distribution.
The Softmax function is used in multi-class models where it returns probabilities of each class, with the target class having the highest probability. The Softmax function mostly appears in almost all the output layers of the deep learning architectures, where they are used.
The main difference between the Sigmoid and Softmax activation function is that the Sigmoid is used in binary classification while the Softmax is used for multivariate classification tasks.
I recommend reading this post with a nice study comparing the performance of a regression model using L1 loss and L2 loss in both the presence and absence of outliers.
Also Check our previous post on Loss Functions: Loss Functions : Part 1