# Multilayer Perceptron

As we already know that ** Perceptron** can only produce linear decision boundaries. But many interesting and real-world problems like Image classification, Object detection, Speech recognition, Text summarization, etc. are not linearly separable. These problems often need non-linear boundaries.

So, the solution to fitting more complex (*i.e.* non-linear) models with neural networks is to use a more complex network that consists of more than just a single perceptron. The take-home message from the perceptron is that all of the learning happens by adapting the synapse weights until prediction is satisfactory. Here we need multiple layers and all that layers should be fully connected to each other, so when the input signal propagates through the network in a forward direction, on a layer-by-layer basis these neural networks commonly referred to as ** Multilayer Perceptron**.

- Each neuron in the network includes a
. The important point to emphasize here is that non-linearity is*non-linear activation function*(i.e., differentiable everywhere).*smooth*

A commonly used form of non-linearity that satisfies this requirement isdefined by the*sigmoidal non-linearity*:*logistic function***σ (z) = 1 / (1 + exp (–z)).**The presence of non-linearities is important because otherwise the input-output relation of the network could be reduced to that of the single-layer perceptron. Moreover.

The use of the logistic function is biologically motivated since it attempts to account for the refractory phase of the real neurons. - The network contains one or more layers of
that are not part of the input and output of the network.*hidden neurons*

These hidden neurons enable the network to learn more complex tasks by extracting progressively more meaning features from the input patterns. , determined by the synapses of the network.*connectivity*

A change in the connectivity of the network requires a change in the population of synaptic connections of their weights.

## Characteristics of Multilayer Perceptron

### How does a multilayer perceptron work?

An MLP is composed of one ** input layer**, one or more

**, and one final layer which is called an**

*hidden layers***.**

*output layer*The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer is fully connected to the next layer.

## Now let’s run the algorithm for Multilayer Perceptron:-

- Suppose for a Multi-class classification we have several kinds of classes at our input layer and each class consists of many no. of data, so it handles one
at a time and it goes through the full training set multiple times. Each pass is called an*mini-batch*.*epoch* - Each mini-batch is passed to the network’s input layer, which just sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the
**forward pass**. - Next, the algorithm measures the network’s output error (i.e., it uses a
that compares the desired output and the actual output of the network, and returns some measure of the error).*loss function* - Then it computes how much each output connection contributed to the error. This is done analytically by simply applying the
, which makes this step fast and precise.*chain rule* - The algorithm then measures how much of these
contributions came from each connection in the layer below, again using the chain rule—and so on until the algorithm reaches the input layer. As we explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm).*error* - Finally, the algorithm performs a
step to tweak all the connection weights in the network, using the error gradients it just computed.*Gradient Descent*

This algorithm is so important, it’s worth summarizing it again: for each training instance the ** Backpropagation** algorithm first makes a prediction (

**), measures the error, then goes through in reverse direction to calculate the contribution error from each layer (**

*forward pass***), and finally slightly tweaks the connection weights to reduce the error (**

*reverse pass***).**

*Gradient Descent step*## Backpropagation

Backpropagation is a method for efficiently computing the gradient of the cost function of a neural network with respect to its parameters. These partial derivatives can then be used to update the network’s parameters using, e.g., gradient descent.

This may be the most common method for training neural networks. Deriving backpropagation involves numerous clever applications of the chain rule for functions of vectors.

### Notation

- : input value of node j
- : activation function
- : output
- : weights
- : a bias for the unit j in layer l
- : target value

**Note: –** It is important to initialize all the hidden layers connection weights randomly, or else training will fail. So, if we initialize all weights and biases to zero it will not learn anything and works in the same manner, so we are not able to minimize the loss. If you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.

In order for this algorithm to work properly, we made a key change to the **MLP’s** architecture: we replaced the step function with the logistic function, **σ (z) = 1 / (1 + exp (–z))**. This was essential because the step function contains only flat segments, so there is no gradient to work with (** Gradient Descent** cannot move on a flat surface), while the logistic function has a well-defined nonzero derivative every-where, allowing

**to make some progress at every step. In fact, the backpropagation algorithm works well with many other activation functions, not just the logistic function. Two other popular activation functions are:**

*Gradient Descent***The hyperbolic tangent function tanh (z)** = **2σ (2z) – 1**

** Tanh** is also like logistic sigmoid but better. Tanh is also

**This often helps speed up convergence. The function is**

*sigmoidal (s-shaped).***. Tanh is a**

*differentiable*

*hyperbolic***. The curves of the tanh function and**

*tangent function***are relatively similar. Let’s compare them. First of all, when the input is large or small, the output is almost smooth and the gradient is small, which is not conducive to weight update. The difference is the output interval.**

*sigmoid function*The output interval of tanh is 1, and the whole function is 0-centric, which is better than ** sigmoid**.

**The Rectified Linear Unit function: ReLU (z)** = **max (0, z)**

It is continuous but unfortunately not differentiable at **z = 0** (the slope changes abruptly, which can make Gradient Descent bounce around), and its derivative is 0 for **z < 0**. However, in practice, it works very well and has the advantage of being fast to compute. Most importantly, it solves the vanishing gradient problem.

## We can basically solve two kinds of problems using MLP:-

### Regression MLPs

First, MLPs can be used for regression tasks. If you want to predict a single value (e.g., the price of a house given many of its features), then you just need a single output neuron: its output is the predicted value.

For multivariate regression (i.e., to predict multiple values at once), you need one output neuron per output dimension. For example, to locate the center of an object on an image, you need to predict 2D coordinates, so you need two output neurons. If you also want to place a bounding box around the object, then you need two more numbers: the width and the height of the object. So you end up with 4 output neurons.

In general, when building an ** MLP **for regression, you do not want to use an activation function for the output neurons, so they are free to output any range of values. However, if you want to guarantee that the output will always be positive, then you can use the

**activation function or the**

*ReLU***activation function in the output layer. Finally, if you want to guarantee that the predictions will fall within a given range of values, then you can use the logistic function or the**

*soft plus***, and scale the labels to the appropriate range: 0 to 1 for the**

*hyperbolic tangent***, or –1 to 1 for the hyperbolic tangent.**

*logistic function* The ** loss function** to use during training is typically the mean squared error, but if you have a lot of outliers in the training set, you may prefer to use the mean absolute error instead. Alternatively, you can use the Huber loss, which is a combination of both.

### Classification MLPs

MLPs can also be used for ** classification** tasks. For a binary classification problem, you just need a single output neuron using the

**: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. Obviously, the estimated**

*logistic activation function***of the negative class is equal to one minus that number.**

*probability*MLPs can also easily handle ** multilabel binary classification** tasks. For example, you could have an email classification system that predicts whether each incoming email is ham or spam, and simultaneously predicts whether it is an urgent or non-urgent email. In this case, you would need two output

**, both using the**

*neurons***: the first would output the probability that the email is spam, and the second would output the probability that it is urgent. More generally, you would dedicate one output neuron for each positive class. Note that the output probabilities do not necessarily add up to one. This lets the model output any combination of labels: you can have non-urgent ham, urgent ham, non-urgent spam, and perhaps even urgent spam (although that would probably be an error).**

*logistic activation function* If each instance can belong only to a single class, out of 3 or more possible classes (e.g., classes 0 through 9 for ** digit image classification**), then you need to have one output neuron per class, and you should use the softmax activation function for the whole output layer. The

**will ensure that all the estimated probabilities are between 0 and 1 and that they add up to one (which is required if the classes are exclusive). This is called a**

*softmax function***.**

*multiclass classification*That’s all about the multilayer perceptron. Hope you enjoyed learning it. You can also check out this amazing post on Loss Functions and Optimization functions