Multilayer Perceptron

As we already know that Perceptron can only produce linear decision boundaries. But many interesting and real-world problems like Image classification, Object detection, Speech recognition, Text summarization, etc. are not linearly separable. These problems often need non-linear boundaries.

So, the solution to fitting more complex (i.e. non-linear) models with neural networks is to use a more complex network that consists of more than just a single perceptron. The take-home message from the perceptron is that all of the learning happens by adapting the synapse weights until prediction is satisfactory. Here we need multiple layers and all that layers should be fully connected to each other, so when the input signal propagates through the network in a forward direction, on a layer-by-layer basis these neural networks commonly referred to as Multilayer Perceptron.

1. Each neuron in the network includes a non-linear activation function. The important point to emphasize here is that non-linearity is smooth (i.e., differentiable everywhere).
A commonly used form of non-linearity that satisfies this requirement is sigmoidal non-linearity defined by the logistic function: σ (z) = 1 / (1 + exp (–z)). The presence of non-linearities is important because otherwise the input-output relation of the network could be reduced to that of the single-layer perceptron. Moreover.
The use of the logistic function is biologically motivated since it attempts to account for the refractory phase of the real neurons.
2. The network contains one or more layers of hidden neurons that are not part of the input and output of the network.
These hidden neurons enable the network to learn more complex tasks by extracting progressively more meaning features from the input patterns.
3.  The network exhibits a high degree of connectivity, determined by the synapses of the network.
A change in the connectivity of the network requires a change in the population of synaptic connections of their weights.

Characteristics of Multilayer Perceptron

How does a multilayer perceptron work?

An MLP is composed of one input layer, one or more hidden layers, and one final layer which is called an output layer.

The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer is fully connected to the next layer.

Now let’s run the algorithm for Multilayer Perceptron:-

1. Suppose for a Multi-class classification we have several kinds of classes at our input layer and each class consists of many no. of data, so it handles one mini-batch at a time and it goes through the full training set multiple times. Each pass is called an epoch.
2. Each mini-batch is passed to the network’s input layer, which just sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the forward pass.
3. Next, the algorithm measures the network’s output error (i.e., it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error).
4. Then it computes how much each output connection contributed to the error. This is done analytically by simply applying the chain rule, which makes this step fast and precise.
5. The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule—and so on until the algorithm reaches the input layer. As we explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm).
6. Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.

This algorithm is so important, it’s worth summarizing it again: for each training instance the Backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through in reverse direction to calculate the contribution error from each layer (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

Backpropagation

Backpropagation is a method for efficiently computing the gradient of the cost function of a neural network with respect to its parameters. These partial derivatives can then be used to update the network’s parameters using, e.g., gradient descent.

This may be the most common method for training neural networks. Deriving backpropagation involves numerous clever applications of the chain rule for functions of vectors.

Notation

• ${z_j}$: input value of node j
• ${g_j}$: activation function
• $a_j=g_j(z_j)$: output
• ${w_{ij}}$: weights
• ${b_{j}}$: a bias for the unit j  in layer l
• ${t_{k}}$: target value

Note: – It is important to initialize all the hidden layers connection weights randomly, or else training will fail. So, if we initialize all weights and biases to zero it will not learn anything and works in the same manner, so we are not able to minimize the loss. If you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.

In order for this algorithm to work properly, we made a key change to the MLP’s architecture: we replaced the step function with the logistic function, σ (z) = 1 / (1 + exp (–z)). This was essential because the step function contains only flat segments, so there is no gradient to work with (Gradient Descent cannot move on a flat surface), while the logistic function has a well-defined nonzero derivative every-where, allowing Gradient Descent to make some progress at every step. In fact, the backpropagation algorithm works well with many other activation functions, not just the logistic function. Two other popular activation functions are:

The hyperbolic tangent function tanh (z) = 2σ (2z) – 1

Tanh is also like logistic sigmoid but better. Tanh is also sigmoidal (s-shaped). This often helps speed up convergence. The function is differentiable. Tanh is a hyperbolic tangent function. The curves of the tanh function and sigmoid function are relatively similar. Let’s compare them. First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval.

The output interval of tanh is 1, and the whole function is 0-centric, which is better than sigmoid.

The Rectified Linear Unit function: ReLU (z) = max (0, z)

It is continuous but unfortunately not differentiable at z = 0 (the slope changes abruptly, which can make Gradient Descent bounce around), and its derivative is 0 for z < 0. However, in practice, it works very well and has the advantage of being fast to compute. Most importantly, it solves the vanishing gradient problem.

We can basically solve two kinds of problems using MLP:-

Regression MLPs

First, MLPs can be used for regression tasks. If you want to predict a single value (e.g., the price of a house given many of its features), then you just need a single output neuron: its output is the predicted value.

For multivariate regression (i.e., to predict multiple values at once), you need one output neuron per output dimension. For example, to locate the center of an object on an image, you need to predict 2D coordinates, so you need two output neurons. If you also want to place a bounding box around the object, then you need two more numbers: the width and the height of the object. So you end up with 4 output neurons.

In general, when building an MLP for regression, you do not want to use an activation function for the output neurons, so they are free to output any range of values. However, if you want to guarantee that the output will always be positive, then you can use the ReLU activation function or the soft plus activation function in the output layer. Finally, if you want to guarantee that the predictions will fall within a given range of values, then you can use the logistic function or the hyperbolic tangent, and scale the labels to the appropriate range: 0 to 1 for the logistic function, or –1 to 1 for the hyperbolic tangent.

The loss function to use during training is typically the mean squared error, but if you have a lot of outliers in the training set, you may prefer to use the mean absolute error instead. Alternatively, you can use the Huber loss, which is a combination of both.

Classification MLPs

MLPs can also be used for classification tasks. For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. Obviously, the estimated probability of the negative class is equal to one minus that number.

MLPs can also easily handle multilabel binary classification tasks. For example, you could have an email classification system that predicts whether each incoming email is ham or spam, and simultaneously predicts whether it is an urgent or non-urgent email. In this case, you would need two output neurons, both using the logistic activation function: the first would output the probability that the email is spam, and the second would output the probability that it is urgent. More generally, you would dedicate one output neuron for each positive class. Note that the output probabilities do not necessarily add up to one. This lets the model output any combination of labels: you can have non-urgent ham, urgent ham, non-urgent spam, and perhaps even urgent spam (although that would probably be an error).

If each instance can belong only to a single class, out of 3 or more possible classes (e.g., classes 0 through 9 for digit image classification), then you need to have one output neuron per class, and you should use the softmax activation function for the whole output layer. The softmax function will ensure that all the estimated probabilities are between 0 and 1 and that they add up to one (which is required if the classes are exclusive). This is called a multiclass classification.

That’s all about the multilayer perceptron. Hope you enjoyed learning it. You can also check out this amazing post on Loss Functions and Optimization functions