# Activation Function

**Activation function** help to determine the output of a ** neural network**. These types of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction.

Activation function also helps in achieving normalization. The value of the Activation function ranges between 0 and 1 or -1 and 1.

In a neural network, inputs are fed into the neurons in the ** input layer**. We will multiply the weights of each neuron to the

**number which gives the**

*input***of the next layer.**

*output*The activation function is placed after the ** hidden layer** which is in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending upon some threshold value.

**use**

*Neural networks***functions so that they can learn any kind of complex data and always try to give an accurate result.**

*non-linear activation***Activation Functions** are used to control the outputs of our neural networks, across different domains from ** object recognition** and

**to**

*classification***,**

*speech recognition***,**

*segmentation***, validating categorically that a proper choice of activation function improves results in neural network computing.**

*scene understanding and description, machine translation test to speech systems, cancer detection systems, fingerprint detection, weather forecast, and self-driving cars*### There are many different kinds of activation functions and we are discussing a few out of them:

- Sigmoid Function
- Tanh Function
- ReLU Function
- Leaky ReLU
- ELU Function
- PReLU
- Soft plus
- Maxout
- Swiss
- Softmax

### 1. Sigmoid Activation Function

A ** sigmoid function** is a type of activation function, and more specifically defined as a squashing function. The range of output in

**is between 0 and 1, making these functions useful in the prediction of**

*Squashing functions***. The formula for this function:**

*probabilities*The ** Sigmoid function** is the most frequently used activation function at the beginning of

**. It is a smoothing function that is easy to derive.**

*deep learning*In the sigmoid function, we can see that its output is in the open interval (0, 1). We can think of probability, but in the strict sense, don’t treat it as a probability. The sigmoid function was once more popular. It can be thought of as the firing rate of a ** neuron**.

Advantages of Sigmoid Function: –

- Easy to understand and is used mostly in the shallow network.
- Output values range between 0 and 1,
the output of each neuron.*normalizing* - Gives a clear prediction.

Sigmoid has three major disadvantages:

during backpropagation.*Sharp damp gradients*- Function output is not centered on 0, which will reduce the efficiency of the weight updates.
- Performs
, which is slower for computers.*exponential operations*

### 2. Tanh or Hyperbolic tangent Activation Function

The hyperbolic tangent function is known as ** tanh function, **which is a smoother zero-centered function whose range lies between -1 and 1, thus the output of the tanh function is given by:

The main advantage provided by the function is that it produces zero centered output thereby aiding the back-propagation process. The function is ** differentiable**. Tanh is a

*hyperbolic***. The curves of the tanh function and**

*tangent function***are relatively similar. Let’s compare them. First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval.**

*sigmoid function*The output interval of tanh is 1, and the whole function is 0-centric, which is better than ** sigmoid**.

In general binary classification problems, the tanh function is used for the ** hidden layer** and the

**is used for the**

*sigmoid function***. However, these are not static, and the specific**

*output layer***to be used must be analyzed according to the specific problem, or it depends on debugging.**

*activation function*### 3. ReLU Activation Function

The ** ReLU (Rectified Linear Unit)** function is an activation function that is currently more popular. Compared with the sigmoid function and the tanh function. ReLU is a non-linear

**that is used in multi-layer**

*activation function***or deep neural networks. This function can be represented as:**

*neural networks*According to equation 1, the output of ReLU is the maximum value between zero and the input value. The output is equal to zero when the input value is negative and the input value when the input is positive. Thus, we can rewrite equation 1 as follows:

The ReLU function is actually a function that takes the maximum value. Note that this is not fully ** interval-derivable**, but we can take sub-gradient, as shown in the figure above. Although ReLU is simple, it is an important achievement in recent years. Recently, the

**has been used instead to calculate the activation values in traditional neural networks or deep neural network paradigms. The reasons for replacing**

*ReLU function***and**

*sigmoid***with**

*hyperbolic tangent***consist of:**

*ReLU***Computation saving**– Theis able to accelerate the training speed of deep neural networks compared to traditional activation functions since the derivative of*ReLU function*is 1 for positive input. Due to constant,*ReLU*do not need to take additional time for computing error terms during the training phase.*deep neural networks***Solving the****vanishing gradient problem**– Thedoes not trigger the*ReLU function*when the number of layers grows. This is because this function does not have an*vanishing gradient problem*. Thus, the earliest layer (the first hidden layer) is able to receive the errors coming from the last layers to adjust all weights between layers. By contrast, a traditional activation function like sigmoid is restricted between 0 and 1, so the errors become small for the first hidden layer. This scenario will lead to a poorly trained neural network.*asymptotic upper and lower bound*

Of course, there are a few disadvantages:

- When the input is negative,
is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the*ReLU*, it is not a problem. Some areas are sensitive and some are insensitive. But in the*forward propagation process*process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the*backpropagation*and*sigmoid function*.*tanh function* - We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a
.*0-centric function*

### 4. Leaky ReLU Activation Function

The ** leaky ReLU**, proposed in 2013 as an

**that introduces some small negative slope to the**

*activation function***to sustain and keep the weight updates alive during the entire**

*ReLU***The**

*propagation***was introduced as a solution to the**

*alpha parameter***problems such that the**

*ReLU’s dead neuron***will not be zero at any time during training. The**

*gradients***computes the gradient with a very small constant value for the negative gradient α in the range of 0.01 thus the**

*leaky ReLU***activation function is computed as:**

*leaky ReLU*The ** leaky ReLU** has an identical result when compared to the standard

**with an exception that it has non-zero gradients over the entire duration so when compared to standard**

*ReLU***and**

*ReLU***there is no significant improvement observed except in**

*tanh***and**

*sparsity***.**

*dispersion*### 5. ELU (Exponential Linear Units) Function

The exponential linear units (ELUs) is another type of ** activation function** proposed in 2015, and they are used to speed up the training of

**. The exponential linear unit (ELU) is given by:**

*deep neural networks*** ELU** is also proposed to solve the problems of

**. Obviously,**

*ReLU***has all the advantages of**

*ELU***:**

*ReLU*- No Dead
issues.*ReLU* - The mean of the output is close to 0,
.*zero-centered*

One small problem is that it is slightly more computationally intensive. Similar to ** Leaky ReLU**, although theoretically better than

**, there is currently no good evidence in practice that**

*ReLU***is always better than**

*ELU**.*

**ReLU**### 6. PReLU (Parametric Rectified Linear Units) Function

** PReLU** is also an improved version of

**. In the negative region,**

*ReLU***has a small slope, which can also avoid the problem of**

*PReLU***death. Compared to**

*ReLU***is a linear operation in the negative region. Although the slope is small, it does not tend to 0, which is a certain advantage. The**

*ELU, PReLU***is given by:**

*PReLU*We look at the formula of ** PReLU**. The parameter α is generally a number between 0 and 1, and it is generally relatively small, such as a few zeros. When α = 0.01, we call

**as**

*PReLU***, it is regarded as a special case**

*Leaky ReLU***it.**

*PReLU*Above, yᵢ is an input for the ith channel, and ai is the negative slope controlling parameter and its learnable during training with back-propagation.

- if aᵢ=0, f becomes
*ReLU* - if aᵢ>0, f becomes
*leaky ReLU*

It was proposed that the performance of ** PReLU** was better than

**in large scale image recognition and these results from the**

*ReLU***were the first to surpass human-level performance on visual recognition challenge.**

*PReLU*### 7. Soft plus Function

The Soft plus activation function is a smooth version of the ** ReLU function** which has

**and**

*smoothing***, thereby enhancing the stabilization and performance of**

*nonzero gradient properties***designed with soft plus units.**

*deep neural network*The ** soft plus** is given by:

**f(x) = ln(1+exp x)**

The ** Softplus function** has been applied in

**mostly however, a comparison of the Soft plus function with the**

*statistical applications***and**

*ReLU***, showed improved performance with lesser epochs to convergence during training, using the Softplus function.**

*Sigmoid functions*### 8. Maxout Function

The ** maxout **activation function is defined as follows:

Where w = weights, b = biases, T = transpose.

The Maxout, proposed, generalizes the leaky ReLU and ReLU where the neuron inherits the properties of ReLU and leaky ReLU where no dying neurons or saturation exist in the network computation. The ** Maxout activation function** is a generalization of the

**and the**

*ReLU***functions. It is a learnable**

*leaky ReLU***.**

*activation function*** Maxout **can be seen as adding a layer of activation function to the

**, which contains a parameter k. Compared with**

*deep learning network***, etc., this layer is special in that it adds**

*ReLU, sigmoid*

*k***and then outputs the largest activation value.**

*neurons*The major drawback of the ** Maxout function** is that it is

**as it doubles the parameters used in all neurons thereby increasing the number of parameters to compute by the network.**

*computationally expensive*### 9. Swish (A Self – Gated) Function

The ** Swish activation function** is one of the first compound

**proposed by the combination of the**

*activation function***and the input function, to achieve a**

*sigmoid activation function***. The properties of the Swish function include**

*hybrid AF***below, and unbounded in the upper limits. The Swish function is given by:**

*smoothness, non-monotonic, bounded***y = x * sigmoid (x)**

** Swish’s** design was inspired by the use of

**for gating in**

*sigmoid functions***and highway networks. We use the same value for gating to simplify the gating mechanism, which is called**

*LSTMs***.**

*self-gating*The advantage of ** self-gating** is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as

**to easily replace activation functions that take a single scalar as input (such as**

*Swish***) without changing the hidden capacity or number of parameters.**

*ReLU*- Unboundedness (unboundedness) is helpful to prevent gradient from gradually approaching 0 during slow training, causing saturation. At the same time, being bounded has advantages, because bounded active functions can have strong
, and larger negative inputs will be resolved.*regularization* - At the same time,
also plays an important role in*smoothness*and*optimization*.*generalization*

### 10. Softmax Function

The ** softmax function** is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the

**them into values between 0 and 1 so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if the input is large, then it turns it into a large probability, but it will always remain between 0 and 1. The**

*softmax transforms***is computed using the relationship:**

*Softmax function*Where zi is an input vector that takes all the input value which is real values. The term on the bottom of the formula is the ** normalization** term which ensures that all the output values of the function will sum to 1, thus constituting a valid

**.**

*probability distribution*The ** Softmax function** is used in

**where it returns probabilities of each class, with the target class having the highest probability. The**

*multi-class models***mostly appears in almost all the output layers of the**

*Softmax function***, where they are used.**

*deep learning architectures*The main difference between the Sigmoid and Softmax activation function is that the Sigmoid is used in binary classification while the Softmax is used for ** multivariate classification tasks**.

I recommend reading this post with a nice study comparing the performance of a regression model using L1 loss and L2 loss in both the presence and absence of outliers.

Also Check our previous post on Loss Functions: Loss Functions : Part 1