100% FREE Updated: Mar 2026 Machine Learning Supervised Learning

Neural Networks

Comprehensive study notes on Neural Networks for GATE DA preparation. This chapter covers key concepts, formulas, and examples needed for your exam.

Neural Networks

Overview

In our preceding studies of machine learning, we have primarily concerned ourselves with models that assume a specific underlying structure in the data, such as linearity. We now advance to a class of models inspired by biological neural systems, which are capable of learning highly complex and non-linear relationships directly from data. Neural networks form the foundational basis of modern deep learning and represent a significant paradigm shift in how we approach problems of prediction and classification. Their power lies in their hierarchical structure, where simple computational units are organized into layers to learn progressively more abstract features.

This chapter is designed to provide a rigorous and principled introduction to the core concepts of neural networks, with a specific focus on the architectures most relevant to the GATE examination. A thorough command of these fundamentals is indispensable, as questions frequently test not only the conceptual understanding of network architecture but also the computational mechanics of information flow. We will systematically dissect the components of a neuron, the arrangement of these neurons into layers, and the mechanism by which these networks process input to produce an output. Our objective is to build a firm theoretical and practical foundation for tackling problems related to these powerful models.

We shall begin by examining the simplest of these architectures, the Feed-Forward Neural Network, to establish the core principles of network computation. Subsequently, we will extend this framework to the Multi-Layer Perceptron (MLP), introducing the concepts of hidden layers and non-linear activation functions. It is this extension that endows neural networks with the ability to approximate any continuous function, making them a universal tool for machine learning tasks.

---

Chapter Contents

| # | Topic | What You'll Learn |
|---|-------|-------------------|
| 1 | Feed-Forward Neural Network | The fundamental architecture and signal propagation. |
| 2 | Multi-Layer Perceptron (MLP) | Introducing hidden layers and non-linear activation. |

---

Learning Objectives

By the End of This Chapter

After completing this chapter, you will be able to:

  • Explain the components of an artificial neuron, including weights, bias, and the activation function.

  • Describe the architecture of a Multi-Layer Perceptron (MLP), differentiating between input, hidden, and output layers.

  • Perform the forward propagation calculation to determine the output of a given neural network for a specific input.

  • Define the role of the backpropagation algorithm and the gradient descent optimization process in network training.

---

We now turn our attention to Feed-Forward Neural Network...
## Part 1: Feed-Forward Neural Network

Introduction

The Feed-Forward Neural Network (FFNN), also known as a Multi-Layer Perceptron (MLP), represents a foundational architecture in the study of neural networks. These networks are characterized by a unidirectional flow of information, where data moves from the input layer, through one or more hidden layers, to the output layer without forming any cycles. This acyclic graph structure distinguishes them from recurrent neural networks. FFNNs are universal function approximators, meaning that a sufficiently large network can approximate any continuous function to an arbitrary degree of accuracy.

In the context of the GATE examination, a firm understanding of FFNNs is paramount. This includes the mechanics of how an input signal is processed to produce an output, a process known as forward propagation, and the method by which the network's parameters (weights and biases) are optimized, which relies on calculating gradients via backpropagation. Questions frequently test the computational aspects of these processes, demanding both conceptual clarity and procedural accuracy. We shall explore the mathematical underpinnings of these networks, focusing on the principles necessary for solving competitive examination problems.

📖 Feed-Forward Neural Network (FFNN)

A Feed-Forward Neural Network is an artificial neural network where connections between the nodes do not form a cycle. It consists of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer receives inputs from the neurons in the preceding layer, computes a weighted sum, adds a bias, and then passes the result through a non-linear activation function to produce its output.

---

Key Concepts

#
## 1. The Artificial Neuron

The fundamental processing unit of a neural network is the artificial neuron, or node. It is a mathematical function conceived as a model of a biological neuron.

A neuron receives one or more inputs, computes their weighted sum, adds a bias term, and passes this result through an activation function. Let us consider a neuron that receives nn inputs denoted by the vector x=[x1,x2,,xn]T\mathbf{x} = [x_1, x_2, \dots, x_n]^T. Each input xix_i is associated with a weight wiw_i. The neuron also has a bias term, bb.

First, we compute the net input, zz, which is the affine transformation of the inputs:

z=(w1x1+w2x2++wnxn)+b=wTx+bz = (w_1 x_1 + w_2 x_2 + \dots + w_n x_n) + b = \mathbf{w}^T \mathbf{x} + b

Next, the net input zz is passed through a non-linear activation function, ϕ(z)\phi(z), to produce the neuron's output, aa:

a=ϕ(z)=ϕ(wTx+b)a = \phi(z) = \phi(\mathbf{w}^T \mathbf{x} + b)

The bias term bb allows the activation function to be shifted to the left or right, which can be critical for successful learning. The weights w\mathbf{w} and bias bb are the learnable parameters of the neuron.










x1x_1

w1w_1

x2x_2

w2w_2

xnx_n

wnw_n
...



Σ

ϕ

zz


bias (bb)



a=ϕ(z)a = \phi(z)

#
## 2. Activation Functions

The activation function introduces non-linearity into the network, enabling it to learn complex patterns that a purely linear model could not. Without non-linear activation functions, a deep neural network would be mathematically equivalent to a single-layer linear model.

#
### Rectified Linear Unit (ReLU)

The most commonly used activation function in modern neural networks is the Rectified Linear Unit, or ReLU.

📐 ReLU Activation Function
ϕ(z)=ReLU(z)=max(0,z)\phi(z) = \text{ReLU}(z) = \max(0, z)

Variables:

    • zz = The net input to the neuron (wTx+b\mathbf{w}^T \mathbf{x} + b)


When to use: ReLU is the default activation function for hidden layers due to its computational efficiency and ability to mitigate the vanishing gradient problem.

A critical property for backpropagation is the derivative of the activation function. The derivative of ReLU is straightforward:

ϕ(z)=ddzReLU(z)={1if z>00if z<0\phi'(z) = \frac{d}{dz} \text{ReLU}(z) = \begin{cases}1 & \text{if } z > 0 \\ 0 & \text{if } z < 0\end{cases}

The derivative is undefined at z=0z=0, but in practice, it is typically set to 00 or 11. For GATE problems, this discontinuity is rarely the focus; the key is that the gradient is 11 for positive inputs and 00 for negative inputs.

#
## 3. Forward Propagation

Forward propagation is the process of computing the output of the neural network, given a set of inputs and parameters (weights and biases). The calculation proceeds layer by layer, from the input layer to the output layer.

Let us denote the activation of neuron jj in layer ll as aj(l)a_j^{(l)}. The net input to this neuron is zj(l)z_j^{(l)}. The weight connecting neuron kk in layer l1l-1 to neuron jj in layer ll is wjk(l)w_{jk}^{(l)}, and the bias of neuron jj in layer ll is bj(l)b_j^{(l)}.

The computation for a single neuron is:

zj(l)=kwjk(l)ak(l1)+bj(l)z_j^{(l)} = \sum_k w_{jk}^{(l)} a_k^{(l-1)} + b_j^{(l)}

aj(l)=ϕ(zj(l))a_j^{(l)} = \phi(z_j^{(l)})

This process is repeated for all neurons in a layer, and then for all subsequent layers until the final output is produced. For the input layer, the activations ak(0)a_k^{(0)} are simply the input features xkx_k.

Worked Example:

Problem: Consider a simple network with 2 input neurons, one hidden layer with 2 neurons, and one output neuron. All neurons use the ReLU activation function. The biases are all 0.

  • Inputs: x1=1,x2=2x_1 = 1, x_2 = -2.

  • Weights from input to hidden layer: w11(1)=2,w12(1)=1,w21(1)=3,w22(1)=1w_{11}^{(1)} = 2, w_{12}^{(1)} = -1, w_{21}^{(1)} = 3, w_{22}^{(1)} = 1.

  • Weights from hidden to output layer: w11(2)=4,w12(2)=3w_{11}^{(2)} = 4, w_{12}^{(2)} = -3.


Calculate the final output of the network.

Solution:

Let h1,h2h_1, h_2 be the outputs of the two hidden neurons and yy be the final output.

Step 1: Calculate the net inputs to the hidden layer neurons (z1(1),z2(1)z_1^{(1)}, z_2^{(1)}).

z1(1)=w11(1)x1+w12(1)x2=(2)(1)+(1)(2)=2+2=4z_1^{(1)} = w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 = (2)(1) + (-1)(-2) = 2 + 2 = 4
z2(1)=w21(1)x1+w22(1)x2=(3)(1)+(1)(2)=32=1z_2^{(1)} = w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 = (3)(1) + (1)(-2) = 3 - 2 = 1

Step 2: Apply the ReLU activation function to find the outputs of the hidden layer (h1,h2h_1, h_2).

h1=ReLU(z1(1))=max(0,4)=4h_1 = \text{ReLU}(z_1^{(1)}) = \max(0, 4) = 4
h2=ReLU(z2(1))=max(0,1)=1h_2 = \text{ReLU}(z_2^{(1)}) = \max(0, 1) = 1

Step 3: Calculate the net input to the output neuron (z1(2)z_1^{(2)}).

z1(2)=w11(2)h1+w12(2)h2=(4)(4)+(3)(1)=163=13z_1^{(2)} = w_{11}^{(2)}h_1 + w_{12}^{(2)}h_2 = (4)(4) + (-3)(1) = 16 - 3 = 13

Step 4: Apply the ReLU activation function to find the final output (yy).

y=ReLU(z1(2))=max(0,13)=13y = \text{ReLU}(z_1^{(2)}) = \max(0, 13) = 13

Answer: The final output of the network is 1313.

---

#
## 4. Backpropagation and Gradient Calculation

Backpropagation is the algorithm used to train neural networks. It efficiently computes the gradient of the loss function with respect to the network's weights. At its core, backpropagation is a practical application of the chain rule from calculus. For GATE, questions often focus on finding the partial derivative of the output with respect to a specific weight.

Let us consider finding the derivative of the final output yy with respect to a weight wijw_{ij} connecting neuron ii to neuron jj. The key is to trace the influence of wijw_{ij} on yy. The weight wijw_{ij} first affects the net input zjz_j of neuron jj, which in turn affects its activation aja_j, which then propagates through the network to affect the final output yy.

Using the chain rule, we can express this relationship:

ywij=yajajzjzjwij\frac{\partial y}{\partial w_{ij}} = \frac{\partial y}{\partial a_j} \frac{\partial a_j}{\partial z_j} \frac{\partial z_j}{\partial w_{ij}}

Let's break down each term:

  • zjwij\frac{\partial z_j}{\partial w_{ij}}: The net input zj=kwjkak+bjz_j = \sum_k w_{jk} a_k + b_j. The derivative with respect to one specific weight wijw_{ij} is simply the corresponding input activation aia_i. So, zjwij=ai\frac{\partial z_j}{\partial w_{ij}} = a_i.

  • ajzj\frac{\partial a_j}{\partial z_j}: This is the derivative of the activation function of neuron jj, ϕ(zj)\phi'(z_j). For ReLU, this is either 1 or 0.

  • yaj\frac{\partial y}{\partial a_j}: This term represents how the activation of neuron jj affects the final output yy. This itself might be a complex chain rule calculation, depending on the network's structure downstream from neuron jj.


Worked Example:

Problem: Consider the network from PYQ 1. Let the top hidden neuron be h1h_1 and the bottom hidden neuron be h2h_2. The inputs are u=2,v=3u=2, v=3, and the weights are a=1,b=1,c=1,d=1,e=4,f=1a=1, b=1, c=1, d=-1, e=4, f=-1. The activation function is ReLU. Calculate ya\frac{\partial y}{\partial a}.

Solution:

First, let's write the equations for the network based on the diagram.

  • Net input to h1h_1: zh1=au+bvz_{h1} = a \cdot u + b \cdot v

  • Output of h1h_1: h1=R(zh1)h_1 = R(z_{h1})

  • Net input to h2h_2: zh2=cu+dvz_{h2} = c \cdot u + d \cdot v

  • Output of h2h_2: h2=R(zh2)h_2 = R(z_{h2})

  • Net input to output neuron: zy=eh1+fh2z_y = e \cdot h_1 + f \cdot h_2

  • Final output: y=R(zy)y = R(z_y)


We need to compute ya\frac{\partial y}{\partial a}. Using the chain rule, we trace the path from yy back to aa: yzyh1zh1ay \leftarrow z_y \leftarrow h_1 \leftarrow z_{h1} \leftarrow a.

ya=yzyzyh1h1zh1zh1a\frac{\partial y}{\partial a} = \frac{\partial y}{\partial z_y} \cdot \frac{\partial z_y}{\partial h_1} \cdot \frac{\partial h_1}{\partial z_{h1}} \cdot \frac{\partial z_{h1}}{\partial a}

Step 1: Perform forward propagation to find the values of all intermediate variables. This is crucial to determine the derivatives of the ReLU functions.

zh1=(1)(2)+(1)(3)=5z_{h1} = (1)(2) + (1)(3) = 5
h1=R(5)=5h_1 = R(5) = 5
zh2=(1)(2)+(1)(3)=1z_{h2} = (1)(2) + (-1)(3) = -1
h2=R(1)=0h_2 = R(-1) = 0
zy=(4)(5)+(1)(0)=20z_y = (4)(5) + (-1)(0) = 20
y=R(20)=20y = R(20) = 20

Step 2: Calculate each term in the chain rule expression.

  • zh1a\frac{\partial z_{h1}}{\partial a}: Since zh1=au+bvz_{h1} = a \cdot u + b \cdot v, the derivative with respect to aa is uu.
zh1a=u=2\frac{\partial z_{h1}}{\partial a} = u = 2
  • h1zh1\frac{\partial h_1}{\partial z_{h1}}: This is the derivative of the ReLU function at zh1z_{h1}. Since zh1=5>0z_{h1} = 5 > 0, the derivative is 1.
h1zh1=R(5)=1\frac{\partial h_1}{\partial z_{h1}} = R'(5) = 1
  • zyh1\frac{\partial z_y}{\partial h_1}: Since zy=eh1+fh2z_y = e \cdot h_1 + f \cdot h_2, the derivative with respect to h1h_1 is ee.
zyh1=e=4\frac{\partial z_y}{\partial h_1} = e = 4
  • yzy\frac{\partial y}{\partial z_y}: This is the derivative of the ReLU function at zyz_y. Since zy=20>0z_y = 20 > 0, the derivative is 1.
yzy=R(20)=1\frac{\partial y}{\partial z_y} = R'(20) = 1

Step 3: Multiply the terms together.

ya=(1)(4)(1)(2)=8\frac{\partial y}{\partial a} = (1) \cdot (4) \cdot (1) \cdot (2) = 8

Answer: ya=8\frac{\partial y}{\partial a} = 8.

#
## 5. Network Equivalence and Simplification

Under certain conditions, a complex neural network can be mathematically equivalent to a simpler one. This is an important concept for understanding the expressive power of networks.

A key scenario arises with linear activation functions. If all neurons in a multi-layer network have a linear activation function, ϕ(z)=z\phi(z) = z, the entire network collapses into a single linear transformation. The composition of linear functions is another linear function.

A more subtle case, as seen in GATE questions, involves the ReLU function when inputs are constrained. If the net input zz to a ReLU neuron is guaranteed to be positive, then ReLU(z)=max(0,z)=z\text{ReLU}(z) = \max(0, z) = z. In this specific domain, the ReLU function behaves identically to a linear (identity) function. This allows for the simplification of network layers.

Consider two consecutive layers (without bias for simplicity) with weight matrices W1W_1 and W2W_2. If the activation function ϕ\phi is linear, the output is y=ϕ(W2ϕ(W1x))=W2(W1x)=(W2W1)xy = \phi(W_2 \phi(W_1 x)) = W_2 (W_1 x) = (W_2 W_1) x. The two layers are equivalent to a single layer with a weight matrix Wequiv=W2W1W_{equiv} = W_2 W_1.

---

Problem-Solving Strategies

💡 GATE Strategy

  • Forward Pass First: When asked to compute a gradient (backpropagation), always perform a full forward pass first. You need the activation values and net inputs at each neuron to determine the derivatives of the activation functions (e.g., whether R(z)R'(z) is 0 or 1).

  • Trace the Path: For gradient calculations, identify the weight in question and trace the computational path from the final output back to that weight. Apply the chain rule by multiplying the local derivatives along this path.

  • Check Input Constraints: In network equivalence problems, carefully check for any constraints on the input values (e.g., "when x1,x2,x3x_1, x_2, x_3 are positive"). Such constraints can cause non-linear activation functions like ReLU to behave linearly, which is often the key to solving the problem.

---

Common Mistakes

⚠️ Avoid These Errors
    • Ignoring Activation Derivatives: Forgetting that the gradient calculation must include the derivative of the activation function. For ReLU, if the net input was negative during the forward pass (z<0z < 0), the neuron's output was 0, and the gradient flowing backward through it will be multiplied by R(z)=0R'(z)=0, effectively blocking the gradient path.
Correct Approach: Always compute the net input zz in the forward pass to determine the value of ϕ(z)\phi'(z) for the backward pass.
    • Incorrect Chain Rule Application: Summing gradients from different paths incorrectly or multiplying them in the wrong order.
Correct Approach: The total gradient with respect to a node is the sum of gradients flowing into it from all outgoing paths. The gradient along a single path is the product of local derivatives.
    • Assuming Linearity: Treating ReLU as a linear function in all cases. It is a piecewise linear function and is non-linear overall.
Correct Approach: Only treat ReLU as linear (ReLU(z)=zReLU(z)=z) if you can prove its argument zz will always be positive given the problem's constraints.

---

Practice Questions

:::question type="NAT" question="A neural network has a single hidden layer with one neuron and one output neuron. The input is x=3x=3. The weight from input to hidden neuron is w1=2w_1 = 2. The bias of the hidden neuron is b1=7b_1 = -7. The weight from the hidden neuron to the output neuron is w2=5w_2 = 5. The bias of the output neuron is b2=1b_2 = -1. Both neurons use the ReLU activation function. What is the final output of the network?" answer="0" hint="Perform a forward pass step-by-step. Calculate the output of the hidden neuron first, then use it as input for the output neuron." solution="
Step 1: Calculate the net input to the hidden neuron, z1z_1.

z1=w1x+b1=(2)(3)+(7)=67=1z_1 = w_1 \cdot x + b_1 = (2)(3) + (-7) = 6 - 7 = -1

Step 2: Calculate the activation of the hidden neuron, h1h_1.

h1=ReLU(z1)=max(0,1)=0h_1 = \text{ReLU}(z_1) = \max(0, -1) = 0

Step 3: Calculate the net input to the output neuron, z2z_2.

z2=w2h1+b2=(5)(0)+(1)=1z_2 = w_2 \cdot h_1 + b_2 = (5)(0) + (-1) = -1

Step 4: Calculate the final output, yy.

y=ReLU(z2)=max(0,1)=0y = \text{ReLU}(z_2) = \max(0, -1) = 0

Result: The final output is 0.
"
:::

:::question type="MCQ" question="Consider a neuron with two inputs x1=2,x2=1x_1=2, x_2=1 and weights w1=3,w2=4w_1=3, w_2=-4. The bias is b=1b=1. The activation function is ReLU. If the output of this neuron is yy, what is the value of the partial derivative yw1\frac{\partial y}{\partial w_1}?" options=["0", "1", "2", "3"] answer="2" hint="First, compute the net input zz and the output yy. Then, apply the chain rule: yw1=yzzw1\frac{\partial y}{\partial w_1} = \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w_1}. Remember that yz\frac{\partial y}{\partial z} depends on whether zz is positive or negative." solution="
Step 1: Calculate the net input zz.

z=w1x1+w2x2+b=(3)(2)+(4)(1)+1=64+1=3z = w_1 x_1 + w_2 x_2 + b = (3)(2) + (-4)(1) + 1 = 6 - 4 + 1 = 3

Step 2: Calculate the output yy.

y=ReLU(z)=ReLU(3)=3y = \text{ReLU}(z) = \text{ReLU}(3) = 3

Step 3: Set up the chain rule expression for yw1\frac{\partial y}{\partial w_1}.

yw1=yzzw1\frac{\partial y}{\partial w_1} = \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w_1}

Step 4: Calculate the components of the chain rule.

The first component is the derivative of the activation function. Since z=3>0z=3 > 0, the derivative of ReLU is 1.

yz=ReLU(3)=1\frac{\partial y}{\partial z} = \text{ReLU}'(3) = 1

The second component is the derivative of the net input with respect to w1w_1.

zw1=w1(w1x1+w2x2+b)=x1=2\frac{\partial z}{\partial w_1} = \frac{\partial}{\partial w_1} (w_1 x_1 + w_2 x_2 + b) = x_1 = 2

Step 5: Compute the final partial derivative.

yw1=12=2\frac{\partial y}{\partial w_1} = 1 \cdot 2 = 2

Result: The value of the partial derivative is 2.
"
:::

:::question type="MSQ" question="Which of the following statements about a standard Feed-Forward Neural Network with ReLU activation in its hidden layers are correct?" options=["The network can model non-linear decision boundaries.", "The derivative of the activation function is constant for all non-zero inputs.", "If all weights and biases are positive, and all inputs are positive, the network behaves as a purely linear model.", "The output of any hidden neuron is always non-negative."] answer="A,C,D" hint="Analyze each property of ReLU and its implications for the network. Consider the definition, derivative, and behavior under specific input conditions." solution="

  • A. The network can model non-linear decision boundaries. This is correct. The ReLU function is non-linear (specifically, piecewise linear), and stacking layers with non-linear activations allows the network to approximate complex, non-linear functions.


  • B. The derivative of the activation function is constant for all non-zero inputs. This is incorrect. The derivative is 11 for positive inputs (z>0z>0) and 00 for negative inputs (z<0z<0). It is not constant for all non-zero inputs.


  • C. If all weights and biases are positive, and all inputs are positive, the network behaves as a purely linear model. This is correct. If inputs, weights, and biases are all positive, the net input z=wTx+bz = \mathbf{w}^T\mathbf{x} + b at every neuron will also be positive. For any positive zz, ReLU(z)=z\text{ReLU}(z)=z. Thus, every activation function becomes an identity function, and the entire network collapses into a linear transformation.


  • D. The output of any hidden neuron is always non-negative. This is correct. By definition, ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z), so the output is always greater than or equal to zero.

"
:::

:::question type="MCQ" question="A neural network layer is defined by the transformation h=ReLU(Wx+b)h = \text{ReLU}(Wx+b), where W=(2113)W = \begin{pmatrix} 2 & 1 \\ -1 & 3 \end{pmatrix}, b=(11)b = \begin{pmatrix} 1 \\ 1 \end{pmatrix}, and input x=(11)x = \begin{pmatrix} 1 \\ -1 \end{pmatrix}. What is the output vector hh?" options=["(23)\begin{pmatrix} 2 \\ -3 \end{pmatrix}", "(20)\begin{pmatrix} 2 \\ 0 \end{pmatrix}", "(14)\begin{pmatrix} 1 \\ -4 \end{pmatrix}", "(05)\begin{pmatrix} 0 \\ 5 \end{pmatrix}"] answer="(20)\begin{pmatrix} 2 \\ 0 \end{pmatrix}" hint="First, compute the matrix-vector product WxWx, then add the bias vector bb to get the net input vector zz. Finally, apply the ReLU function element-wise to zz." solution="
Step 1: Compute the matrix-vector product WxWx.

Wx=(2113)(11)=((2)(1)+(1)(1)(1)(1)+(3)(1))=(2113)=(14)Wx = \begin{pmatrix} 2 & 1 \\ -1 & 3 \end{pmatrix} \begin{pmatrix} 1 \\ -1 \end{pmatrix} = \begin{pmatrix} (2)(1) + (1)(-1) \\ (-1)(1) + (3)(-1) \end{pmatrix} = \begin{pmatrix} 2 - 1 \\ -1 - 3 \end{pmatrix} = \begin{pmatrix} 1 \\ -4 \end{pmatrix}

Step 2: Add the bias vector bb to get the net input vector zz.

z=Wx+b=(14)+(11)=(1+14+1)=(23)z = Wx + b = \begin{pmatrix} 1 \\ -4 \end{pmatrix} + \begin{pmatrix} 1 \\ 1 \end{pmatrix} = \begin{pmatrix} 1 + 1 \\ -4 + 1 \end{pmatrix} = \begin{pmatrix} 2 \\ -3 \end{pmatrix}

Step 3: Apply the ReLU function element-wise to the vector zz.

h=ReLU(z)=(ReLU(2)ReLU(3))=(max(0,2)max(0,3))=(20)h = \text{ReLU}(z) = \begin{pmatrix} \text{ReLU}(2) \\ \text{ReLU}(-3) \end{pmatrix} = \begin{pmatrix} \max(0, 2) \\ \max(0, -3) \end{pmatrix} = \begin{pmatrix} 2 \\ 0 \end{pmatrix}

Result: The output vector hh is (20)\begin{pmatrix} 2 \\ 0 \end{pmatrix}.
"
:::

---

Summary

Key Takeaways for GATE

  • Forward Propagation is Sequential Calculation: Master the layer-by-layer computation of net inputs (z=Wx+bz = Wx+b) and activations (a=ϕ(z)a=\phi(z)). This is the foundation for all FFNN problems.

  • Backpropagation is Applied Chain Rule: To find the gradient of the output with respect to a weight, you must trace the path of influence backwards and multiply the local derivatives. The derivative of the activation function is a critical component.

  • ReLU's Derivative is Key: The derivative of ReLU(z)\text{ReLU}(z) is 11 if z>0z>0 and 00 if z<0z<0. A forward pass is mandatory before backpropagation to determine the sign of the net inputs and thus the value of these derivatives.

  • Recognize Network Simplification: Be alert for conditions (like all positive inputs to a ReLU network) that make non-linear activations behave linearly, allowing complex networks to be simplified into equivalent single-layer models.

---

What's Next?

💡 Continue Learning

This topic connects to:

    • Gradient Descent and Optimization Algorithms: The gradients computed via backpropagation are the essential inputs for optimization algorithms like Stochastic Gradient Descent (SGD), Adam, and RMSprop, which are used to update the network's weights during training. Understanding FFNNs is the first step; understanding how they learn is the next.

    • Convolutional Neural Networks (CNNs): CNNs are a specialized type of feed-forward network, primarily used for image and grid-like data. They build upon the concepts of layers, weights, and activation functions but introduce specialized layers like convolutional and pooling layers.

    • Recurrent Neural Networks (RNNs): While FFNNs process data in one direction, RNNs introduce cycles, allowing them to maintain a state or memory. This makes them suitable for sequential data like time series or natural language. A solid grasp of FFNNs is necessary before tackling the more complex data flow of RNNs.

---

💡 Moving Forward

Now that you understand Feed-Forward Neural Network, let's explore Multi-Layer Perceptron (MLP) which builds on these concepts.

---

Part 2: Multi-Layer Perceptron (MLP)

Introduction

The Multi-Layer Perceptron (MLP) represents a foundational architecture in the field of artificial neural networks. While simpler models like the single-layer perceptron are limited to solving linearly separable problems, the MLP overcomes this fundamental limitation by incorporating one or more intermediate, or "hidden," layers between its input and output. This architectural enhancement grants the MLP the capacity to learn complex, non-linear relationships within data.

The true power of the MLP lies in its ability to serve as a universal function approximator. With a sufficient number of hidden neurons and appropriate non-linear activation functions, an MLP can approximate any continuous function to an arbitrary degree of accuracy. This makes it an exceptionally versatile tool for a wide range of supervised learning tasks, including classification and regression. In our study for the GATE examination, a thorough understanding of the MLP's structure, the forward propagation of signals, and the backpropagation algorithm for training is of paramount importance.

📖 Multi-Layer Perceptron (MLP)

A Multi-Layer Perceptron is a class of feedforward artificial neural network (ANN) that consists of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Each node, or neuron, in one layer is connected with a certain weight to every neuron in the following layer. Except for the input nodes, each neuron is a processing unit with a non-linear activation function.

---

Key Concepts

#
## 1. From the Single Perceptron to the MLP

To appreciate the necessity of the MLP, we must first consider the limitations of its predecessor, the single-layer perceptron. A single perceptron computes a linear combination of its inputs and applies an activation function. For an input vector xRdx \in \mathbb{R}^d, the output yy is given by:

y=ϕ(wTx+b)y = \phi(w^T x + b)

Here, ww is the weight vector, bb is the bias, and ϕ\phi is the activation function. If ϕ\phi is a step function (like the sign function), the perceptron acts as a linear classifier, defining a hyperplane as its decision boundary.

The critical limitation is that such a model can only classify data that is linearly separable. A classic example of a problem that a single perceptron cannot solve is the XOR problem.



Linearly Separable (AND)





(1,1)
(0,0)

Non-Linearly Separable (XOR)




(1,1)
(0,0)
(1,0)
(0,1)

The MLP overcomes this by stacking layers of neurons. The outputs of one layer become the inputs to the next. This layered composition of non-linear functions allows the MLP to construct complex, non-linear decision boundaries.



Input Layer

x1

x2

xd

Hidden Layer

h1

h2

...

hH

Output Layer

y














#
## 2. Activation Functions

The choice of activation function is critical. If we were to use a linear activation function in the hidden layers, the entire MLP would collapse into an equivalent single-layer linear model, thereby losing its ability to model non-linearity. Therefore, we require non-linear activation functions.

Sigmoid (Logistic):
The sigmoid function maps any real-valued number into the range (0,1)(0, 1).

📐 Sigmoid Function
σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Variables:

    • zz = The weighted sum of inputs plus bias (wTx+bw^T x + b)


When to use: Historically used in hidden layers and commonly in the output layer for binary classification problems to interpret the output as a probability.

Hyperbolic Tangent (tanh):
The tanh function is similar to the sigmoid but maps inputs to the range (1,1)(-1, 1).

📐 Hyperbolic Tangent (tanh)
tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Variables:

    • zz = The weighted sum of inputs plus bias (wTx+bw^T x + b)


When to use: Often preferred over sigmoid for hidden layers as its zero-centered output can help in faster convergence during training.

Rectified Linear Unit (ReLU):
The ReLU function is one of the most widely used activation functions in modern neural networks.

📐 Rectified Linear Unit (ReLU)
ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Variables:

    • zz = The weighted sum of inputs plus bias (wTx+bw^T x + b)


When to use: The default choice for hidden layers in most applications due to its computational efficiency and its ability to mitigate the vanishing gradient problem.

#
## 3. The Forward Pass

The forward pass is the process of computing the network's output for a given input vector xx. We proceed layer by layer, from the input to the output.

Consider an MLP with one hidden layer.
Let:

  • XX be the input vector.

  • W(1)W^{(1)} and b(1)b^{(1)} be the weight matrix and bias vector for the hidden layer.

  • W(2)W^{(2)} and b(2)b^{(2)} be the weight matrix and bias vector for the output layer.

  • ϕ\phi be the activation function.


The computation proceeds as follows:
  • Calculate the pre-activation for the hidden layer (Z(1)Z^{(1)}):

  • Z(1)=W(1)X+b(1)Z^{(1)} = W^{(1)}X + b^{(1)}

  • Calculate the activation of the hidden layer (A(1)A^{(1)}):

  • A(1)=ϕ(Z(1))A^{(1)} = \phi(Z^{(1)})

  • Calculate the pre-activation for the output layer (Z(2)Z^{(2)}):

  • Z(2)=W(2)A(1)+b(2)Z^{(2)} = W^{(2)}A^{(1)} + b^{(2)}

  • Calculate the final output (A(2)A^{(2)} or y^\hat{y}):

  • y^=A(2)=ϕ(Z(2))\hat{y} = A^{(2)} = \phi(Z^{(2)})

    (Note: The output layer might use a different activation function, e.g., softmax for multi-class classification).

    Worked Example:

    Problem:
    Consider a simple MLP with 2 input neurons, a hidden layer with 2 neurons, and 1 output neuron. The activation function for all neurons is ReLU. The weights and biases are given as:

    W(1)=[0.51.00.80.2]W^{(1)} = \begin{bmatrix} 0.5 & -1.0 \\ 0.8 & 0.2 \end{bmatrix}, b(1)=[0.10.3]b^{(1)} = \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}

    W(2)=[0.70.4]W^{(2)} = \begin{bmatrix} 0.7 & -0.4 \end{bmatrix}, b(2)=[0.2]b^{(2)} = [0.2]

    Calculate the output of the network for the input vector X=[23]X = \begin{bmatrix} 2 \\ 3 \end{bmatrix}.

    Solution:

    Step 1: Calculate the pre-activation for the hidden layer, Z(1)Z^{(1)}.

    Z(1)=W(1)X+b(1)=[0.51.00.80.2][23]+[0.10.3]Z^{(1)} = W^{(1)}X + b^{(1)} = \begin{bmatrix} 0.5 & -1.0 \\ 0.8 & 0.2 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}
    Z(1)=[(0.5×2)+(1.0×3)(0.8×2)+(0.2×3)]+[0.10.3]=[131.6+0.6]+[0.10.3]Z^{(1)} = \begin{bmatrix} (0.5 \times 2) + (-1.0 \times 3) \\ (0.8 \times 2) + (0.2 \times 3) \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix} = \begin{bmatrix} 1 - 3 \\ 1.6 + 0.6 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}
    Z(1)=[2.02.2]+[0.10.3]=[1.91.9]Z^{(1)} = \begin{bmatrix} -2.0 \\ 2.2 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix} = \begin{bmatrix} -1.9 \\ 1.9 \end{bmatrix}

    Step 2: Apply the ReLU activation function to get the hidden layer's output, A(1)A^{(1)}.

    A(1)=ReLU(Z(1))=[max(0,1.9)max(0,1.9)]A^{(1)} = \text{ReLU}(Z^{(1)}) = \begin{bmatrix} \max(0, -1.9) \\ \max(0, 1.9) \end{bmatrix}
    A(1)=[01.9]A^{(1)} = \begin{bmatrix} 0 \\ 1.9 \end{bmatrix}

    Step 3: Calculate the pre-activation for the output layer, Z(2)Z^{(2)}.

    Z(2)=W(2)A(1)+b(2)=[0.70.4][01.9]+[0.2]Z^{(2)} = W^{(2)}A^{(1)} + b^{(2)} = \begin{bmatrix} 0.7 & -0.4 \end{bmatrix} \begin{bmatrix} 0 \\ 1.9 \end{bmatrix} + [0.2]
    Z(2)=[(0.7×0)+(0.4×1.9)]+[0.2]=[0.76]+[0.2]Z^{(2)} = [(0.7 \times 0) + (-0.4 \times 1.9)] + [0.2] = [-0.76] + [0.2]
    Z(2)=[0.56]Z^{(2)} = [-0.56]

    Step 4: Apply the ReLU activation function to get the final output, y^\hat{y}.

    y^=ReLU(Z(2))=max(0,0.56)\hat{y} = \text{ReLU}(Z^{(2)}) = \max(0, -0.56)
    y^=0\hat{y} = 0

    Answer: The final output of the network is 00.

    #
    ## 4. Backpropagation and Gradient Descent

    Training an MLP involves adjusting its weights and biases to minimize a loss function, which measures the discrepancy between the predicted outputs (y^\hat{y}) and the true target values (yy). The most common algorithm for this is backpropagation combined with an optimization algorithm like gradient descent.

    The foundational idea of gradient descent is to update the parameters (weights ww) in the opposite direction of the gradient of the loss function LL.

    wnew=woldηLww_{new} = w_{old} - \eta \frac{\partial L}{\partial w}

    Here, η\eta is the learning rate, a hyperparameter that controls the step size.

    Backpropagation is an efficient algorithm for computing these gradients, Lw\frac{\partial L}{\partial w}, for all weights in the network. It works by applying the chain rule of calculus, starting from the output layer and moving backward through the network.

    • First, the gradient of the loss with respect to the output layer's weights is computed.
    • Then, this error is "propagated" backward to the previous layer. The gradient for the hidden layer's weights is calculated based on the error signal from the output layer.
    • This process continues until the gradients for all weights have been computed.
    This method is more complex than the simple update rule of the single-layer perceptron (which only applies to specific loss functions and models), but it is a general mechanism that allows for the training of deep, complex networks.

    ---

    Problem-Solving Strategies

    💡 GATE Strategy: Dimensionality Check

    When solving MLP forward pass problems, always verify the dimensions of your matrices. If the input layer has dd neurons and the hidden layer has hh neurons, the weight matrix W(1)W^{(1)} must have the dimensions h×dh \times d. The bias vector b(1)b^{(1)} will have dimension h×1h \times 1. This check can quickly identify calculation errors.

    For an input XX of size d×1d \times 1:

      • W(1)W^{(1)} is h×dh \times d

      • W(1)XW^{(1)}X results in a h×1h \times 1 vector.

      • b(1)b^{(1)} is h×1h \times 1.

      • Z(1)=W(1)X+b(1)Z^{(1)} = W^{(1)}X + b^{(1)} is a valid operation, resulting in a h×1h \times 1 vector.

    ---

    Common Mistakes

    ⚠️ Avoid These Errors
      • Forgetting Non-Linearity: Using a linear activation function (or no activation function) in hidden layers. This makes the entire MLP equivalent to a single linear model, defeating its purpose.
    Correct Approach: Always use a non-linear activation function like ReLU, Sigmoid, or tanh in the hidden layers.
      • Incorrect ReLU Application: Applying ReLU incorrectly, for instance, by taking the absolute value instead of the maximum of zero and the input.
    Correct Approach: Remember that ReLU(z)=z\text{ReLU}(z) = z if z>0z > 0, and ReLU(z)=0\text{ReLU}(z) = 0 if z0z \le 0.
      • Mixing up Weight Matrix Dimensions: Confusing the row and column dimensions of the weight matrices (e.g., using d×hd \times h instead of h×dh \times d).
    Correct Approach: Use the dimensionality check strategy. The number of rows in a weight matrix must equal the number of neurons in the destination layer, and the number of columns must equal the number of neurons in the source layer.

    ---

    Practice Questions

    :::question type="MCQ" question="An MLP has an input layer with 3 neurons, a single hidden layer with 4 neurons, and an output layer with 2 neurons. What are the dimensions of the weight matrix for the hidden layer (W(1)W^{(1)}) and the output layer (W(2)W^{(2)}) respectively?" options=["W(1):3×4W^{(1)}: 3 \times 4, W(2):4×2W^{(2)}: 4 \times 2","W(1):4×3W^{(1)}: 4 \times 3, W(2):2×4W^{(2)}: 2 \times 4","W(1):3×4W^{(1)}: 3 \times 4, W(2):2×4W^{(2)}: 2 \times 4","W(1):4×3W^{(1)}: 4 \times 3, W(2):4×2W^{(2)}: 4 \times 2"] answer="W(1):4×3W^{(1)}: 4 \times 3, W(2):2×4W^{(2)}: 2 \times 4" hint="The dimensions of a weight matrix WW connecting layer A to layer B are (number of neurons in B) x (number of neurons in A)." solution="Step 1: Analyze the connection from the input layer to the hidden layer.
    The source layer (input) has 3 neurons.
    The destination layer (hidden) has 4 neurons.
    Therefore, the dimension of the weight matrix W(1)W^{(1)} is (destination size) x (source size), which is 4×34 \times 3.

    Step 2: Analyze the connection from the hidden layer to the output layer.
    The source layer (hidden) has 4 neurons.
    The destination layer (output) has 2 neurons.
    Therefore, the dimension of the weight matrix W(2)W^{(2)} is (destination size) x (source size), which is 2×42 \times 4.

    Result: The dimensions are W(1):4×3W^{(1)}: 4 \times 3 and W(2):2×4W^{(2)}: 2 \times 4."
    :::

    :::question type="NAT" question="A neuron in a hidden layer uses the ReLU activation function. It receives inputs from two neurons with values x1=2x_1 = -2 and x2=5x_2 = 5. The corresponding weights are w1=1.5w_1 = 1.5 and w2=0.5w_2 = 0.5. The bias for this neuron is b=0.5b = -0.5. What is the output of this neuron?" answer="-1.0" hint="Calculate the weighted sum plus bias, z=w1x1+w2x2+bz = w_1x_1 + w_2x_2 + b, and then apply the ReLU function, max(0,z)\max(0, z)." solution="Step 1: Calculate the weighted sum of the inputs.

    Sum=w1x1+w2x2\text{Sum} = w_1x_1 + w_2x_2

    Sum=(1.5×2)+(0.5×5)=3.0+2.5=0.5\text{Sum} = (1.5 \times -2) + (0.5 \times 5) = -3.0 + 2.5 = -0.5

    Step 2: Add the bias to get the pre-activation value, zz.

    z=Sum+b=0.5+(0.5)=1.0z = \text{Sum} + b = -0.5 + (-0.5) = -1.0

    Step 3: Apply the ReLU activation function to zz.

    Output=ReLU(z)=max(0,1.0)\text{Output} = \text{ReLU}(z) = \max(0, -1.0)

    Output=0\text{Output} = 0

    Result: The output of the neuron is 0.
    "
    :::
    :::question type="NAT" question="A neuron in a hidden layer uses the ReLU activation function. It receives inputs from two neurons with values x1=4x_1 = 4 and x2=1x_2 = -1. The corresponding weights are w1=0.5w_1 = 0.5 and w2=2w_2 = -2. The bias for this neuron is b=1.0b = 1.0. Calculate the output of this neuron." answer="5.0" hint="Calculate the pre-activation z=w1x1+w2x2+bz = w_1x_1 + w_2x_2 + b, and then apply the ReLU function, output=max(0,z)\text{output} = \max(0, z)." solution="Step 1: Calculate the weighted sum of inputs.

    Sum=w1x1+w2x2=(0.5×4)+(2×1)\text{Sum} = w_1x_1 + w_2x_2 = (0.5 \times 4) + (-2 \times -1)

    Sum=2+2=4.0\text{Sum} = 2 + 2 = 4.0

    Step 2: Add the bias to get the pre-activation value zz.

    z=Sum+b=4.0+1.0=5.0z = \text{Sum} + b = 4.0 + 1.0 = 5.0

    Step 3: Apply the ReLU activation function.

    Output=ReLU(z)=max(0,5.0)\text{Output} = \text{ReLU}(z) = \max(0, 5.0)

    Output=5.0\text{Output} = 5.0

    Result: The output of the neuron is 5.0."
    :::

    :::question type="MSQ" question="Which of the following statements about activation functions in MLPs are correct?" options=["The ReLU function is linear for all inputs z<0z < 0.","The sigmoid function's output is always in the range [0,1][0, 1].","The tanh function is zero-centered, meaning its range is symmetric around zero.","Using a linear activation function in all hidden layers allows the MLP to model complex non-linear data."] answer="The ReLU function is linear for all inputs z<0z < 0.,The tanh function is zero-centered, meaning its range is symmetric around zero." hint="Evaluate the properties of each activation function mentioned. Recall the output ranges and shapes of their graphs." solution="Option A: The ReLU function is defined as max(0,z)\max(0, z). For all z<0z < 0, its output is a constant 0. A constant function is a form of a linear function (y=0x+0y = 0x + 0). Thus, this statement is correct.

    Option B: The sigmoid function σ(z)=1/(1+ez)\sigma(z) = 1 / (1 + e^{-z}) outputs values in the range (0,1)(0, 1). It approaches 0 and 1 asymptotically but never strictly reaches them. Therefore, the range is (0,1)(0,1), not [0,1][0,1]. This statement is incorrect.

    Option C: The tanh function outputs values in the range (1,1)(-1, 1). This range is symmetric around 0, making it zero-centered. This property can be beneficial for optimization. This statement is correct.

    Option D: If all hidden layers use a linear activation function, the composition of these linear functions is itself a linear function. The entire network collapses to a single linear model and cannot learn non-linear patterns. This statement is incorrect.

    Result: The correct statements are A and C."
    :::

    :::question type="MCQ" question="The primary motivation for using Multi-Layer Perceptrons over single-layer perceptrons is their ability to:" options=["Converge faster during training.","Solve non-linearly separable problems.","Require less memory.","Use a simpler weight update rule."] answer="Solve non-linearly separable problems." hint="Consider the fundamental limitation of a single-layer perceptron's decision boundary." solution="A single-layer perceptron can only form a linear decision boundary (a hyperplane). This means it can only solve problems where the classes are linearly separable. The introduction of hidden layers with non-linear activation functions in an MLP allows the model to learn complex, non-linear decision boundaries. This is the key advantage and primary reason for their development and use. While other aspects like convergence speed can vary, the core capability that distinguishes MLPs is their ability to handle non-linear separability."
    :::

    ---

    Summary

    Key Takeaways for GATE

    • Overcoming Linear Separability: The fundamental purpose of an MLP is to solve problems that are not linearly separable by using hidden layers to create complex, non-linear decision boundaries.

    • Role of Non-Linear Activations: Non-linear activation functions (ReLU, Sigmoid, tanh) are essential components of hidden layers. Without them, an MLP would be functionally equivalent to a single-layer linear model.

    • Forward Pass Calculation: Be proficient in calculating the output of an MLP step-by-step. This involves matrix multiplications, addition of biases, and application of activation functions, layer by layer. Pay close attention to matrix dimensions.

    • Backpropagation is Key to Learning: Training is performed using backpropagation to compute the gradients of a loss function with respect to the network's weights, which are then updated via an optimization algorithm like gradient descent.

    ---

    What's Next?

    💡 Continue Learning

    This topic serves as a gateway to more advanced neural network architectures. Understanding the MLP is crucial before proceeding to:

      • Convolutional Neural Networks (CNNs): These are specialized MLPs that use convolutional layers, primarily for processing grid-like data such as images. They build upon the concepts of layers, activation functions, and backpropagation.

      • Recurrent Neural Networks (RNNs): While MLPs are feedforward, RNNs have connections that form cycles, allowing them to process sequences of data. They share the concepts of neurons and learned weights but introduce the idea of a hidden state.

      • Optimization Algorithms: The gradient descent used to train MLPs is the simplest optimizer. Explore more advanced methods like Adam, RMSprop, and Momentum, which are commonly used to train deep networks more efficiently.


    Master these connections for a comprehensive understanding of neural networks for the GATE examination.

    ---

    Chapter Summary

    📖 Neural Networks - Key Takeaways

    In this chapter, we have explored the foundational principles of feed-forward neural networks, with a particular focus on the Multi-Layer Perceptron (MLP). As we conclude our discussion, it is essential to consolidate the most critical concepts for examination purposes.

    • The Artificial Neuron as a Computational Unit: The fundamental building block of a neural network is the artificial neuron. It computes a weighted sum of its inputs, adds a bias, and then passes the result through a non-linear activation function to produce its output.

    • The Role of Non-Linear Activation Functions: We have seen that non-linear activation functions (such as Sigmoid, Tanh, and ReLU) are indispensable. Without them, a multi-layer network, regardless of its depth, would be mathematically equivalent to a single-layer linear model, severely limiting its ability to learn complex, non-linear relationships in data.

    • The Multi-Layer Perceptron (MLP) Architecture: An MLP consists of an input layer, one or more hidden layers, and an output layer. The "depth" of the network refers to the number of hidden layers. The Universal Approximation Theorem provides the theoretical underpinning that even a single hidden layer can, in principle, approximate any continuous function.

    • Forward Propagation: This is the process of passing an input signal through the network, layer by layer, from input to output, to generate a prediction. At each layer, the computation involves a linear transformation (matrix multiplication with weights) followed by a non-linear activation.

    • The Backpropagation Algorithm: This is the cornerstone of training neural networks. Backpropagation is an efficient algorithm for computing the gradient of the loss function with respect to every weight and bias in the network. It applies the chain rule of calculus recursively, starting from the output layer and moving backward.

    • Gradient-Based Optimization: The gradients calculated via backpropagation are used by an optimization algorithm, most commonly Gradient Descent or its variants (e.g., SGD), to iteratively adjust the network's parameters (weights and biases) in the direction that minimizes the loss function. The learning rate, η\eta, is a critical hyperparameter that controls the step size of these adjustments.

    ---

    Chapter Review Questions

    :::question type="MCQ" question="Consider a Multi-Layer Perceptron (MLP) with 10 neurons in the input layer, a single hidden layer with 8 neurons, and an output layer with 3 neurons for a multi-class classification problem. The hidden layer uses a ReLU activation function and the output layer uses a Softmax function. What is the total number of trainable parameters (weights and biases) in this network?" options=["104", "115", "117", "124"] answer="C" hint="Remember to account for both the weights connecting the layers and the bias term for each neuron in the hidden and output layers." solution="We calculate the parameters for each connection and layer sequentially.

  • Parameters between Input and Hidden Layer:

  • - The number of weights connecting the 10 input neurons to the 8 hidden neurons is 10×8=8010 \times 8 = 80.
    - Each of the 8 neurons in the hidden layer has its own bias term. So, there are 8 biases.
    - Total parameters for the hidden layer: 80+8=8880 + 8 = 88.

  • Parameters between Hidden and Output Layer:

  • - The number of weights connecting the 8 hidden neurons to the 3 output neurons is 8×3=248 \times 3 = 24.
    - Each of the 3 neurons in the output layer has its own bias term. So, there are 3 biases.
    - Total parameters for the output layer: 24+3=2724 + 3 = 27.

  • Total Trainable Parameters:

  • - The total number of parameters in the network is the sum of the parameters calculated above.
    -
    Total Parameters=(Input×Hidden+Hidden Biases)+(Hidden×Output+Output Biases)\text{Total Parameters} = (\text{Input} \times \text{Hidden} + \text{Hidden Biases}) + (\text{Hidden} \times \text{Output} + \text{Output Biases})

    -
    Total Parameters=(10×8+8)+(8×3+3)=88+27=117\text{Total Parameters} = (10 \times 8 + 8) + (8 \times 3 + 3) = 88 + 27 = 117

    Thus, the total number of trainable parameters is 117."
    :::

    :::question type="NAT" question="A neuron uses the Rectified Linear Unit (ReLU) activation function, defined as f(z)=max(0,z)f(z) = \max(0, z). The neuron receives two inputs, x1=3x_1 = -3 and x2=4x_2 = 4, with corresponding weights w1=0.5w_1 = 0.5 and w2=0.8w_2 = 0.8. The bias for this neuron is b=1.5b = -1.5. Calculate the output of this neuron." answer="0.2" hint="First, compute the weighted sum of the inputs plus the bias, z=w1x1+w2x2+bz = w_1 x_1 + w_2 x_2 + b. Then, apply the ReLU activation function to this sum." solution="The process involves two steps: calculating the net input zz and then applying the activation function.

  • Calculate the net input zz:

  • The net input is the linear combination of inputs and weights, plus the bias.
    z=w1x1+w2x2+bz = w_1 x_1 + w_2 x_2 + b

    Substituting the given values:
    z=(0.5)(3)+(0.8)(4)+(1.5)z = (0.5)(-3) + (0.8)(4) + (-1.5)

    z=1.5+3.21.5z = -1.5 + 3.2 - 1.5

    z=3.23.0=0.2z = 3.2 - 3.0 = 0.2

  • Apply the ReLU activation function:

  • The ReLU function is f(z)=max(0,z)f(z) = \max(0, z).
    Output=f(0.2)=max(0,0.2)=0.2\text{Output} = f(0.2) = \max(0, 0.2) = 0.2

    The output of the neuron is 0.2.
    :::

    :::question type="MCQ" question="What is the primary motivation for using the backpropagation algorithm in training neural networks?" options=["To implement a non-linear decision boundary", "To prevent the network from overfitting the training data", "To efficiently compute the gradient of the loss function with respect to the network weights", "To initialize the weights of the network in an optimal manner"] answer="C" hint="Think about the core challenge in using gradient descent for a complex, multi-layered function." solution="The core of training a neural network is to minimize a loss function EE by adjusting its weights and biases ww. Gradient descent requires calculating the partial derivative of the loss function with respect to each weight, Ew\frac{\partial E}{\partial w}.

    • Option A is incorrect. Non-linear decision boundaries are achieved by using non-linear activation functions, not by the training algorithm itself.
    • Option B is incorrect. Preventing overfitting is the role of regularization techniques (e.g., L2 regularization, dropout), not backpropagation.
    • Option D is incorrect. Weight initialization is a separate, important step, but it is not the purpose of backpropagation.
    • Option C is correct. For a deep network, the loss function is a highly complex, nested function of millions of parameters. Calculating the gradient for each parameter naively would be computationally intractable. Backpropagation is a dynamic programming approach that systematically applies the chain rule of calculus to compute these gradients in a single pass (from output to input), making the training of deep networks feasible. It is fundamentally an algorithm for efficient gradient computation.
    :::

    :::question type="NAT" question="In a neural network, a particular weight ww has a current value of 0.60.6. During a training step, the gradient of the Mean Squared Error loss with respect to this weight is calculated to be Ew=1.5\frac{\partial E}{\partial w} = 1.5. If the learning rate η\eta is set to 0.050.05, calculate the updated value of the weight after one step of standard gradient descent." answer="0.525" hint="The standard gradient descent update rule is wnew=woldηEww_{\text{new}} = w_{\text{old}} - \eta \frac{\partial E}{\partial w}." solution="We apply the standard gradient descent update rule to find the new value of the weight.

    The update rule is given by:

    w(t+1)=w(t)ηEw(t)w^{(t+1)} = w^{(t)} - \eta \frac{\partial E}{\partial w^{(t)}}

    Here, we are given:

    • The current weight, w(t)=0.6w^{(t)} = 0.6

    • The learning rate, η=0.05\eta = 0.05

    • The gradient of the loss with respect to the weight, Ew=1.5\frac{\partial E}{\partial w} = 1.5


    Substituting these values into the formula:
    w(t+1)=0.6(0.05)(1.5)w^{(t+1)} = 0.6 - (0.05)(1.5)

    w(t+1)=0.60.075w^{(t+1)} = 0.6 - 0.075

    w(t+1)=0.525w^{(t+1)} = 0.525

    The updated value of the weight after one step is 0.525.
    :::

    ---

    What's Next?

    💡 Continue Your GATE Journey

    Having completed this chapter on Neural Networks, you have established a firm foundation in one of the most powerful areas of machine learning. The principles of layered architecture, non-linear transformations, and gradient-based learning are fundamental and will reappear in more advanced topics.

    Key connections to your learning so far:

      • Linear & Logistic Regression: We can now view these simpler models as special cases of a neural network. A single neuron with a linear activation function is equivalent to linear regression, while a single neuron with a sigmoid activation function is equivalent to logistic regression. The MLP is a powerful generalization of these ideas.

      • Linear Algebra & Calculus: Our entire discussion has been built upon concepts from these fields. Forward propagation is essentially a sequence of matrix multiplications, and backpropagation is a sophisticated application of the chain rule from multivariate calculus.


      Future chapters that build on these concepts:
      • Convolutional Neural Networks (CNNs): The next logical step is to explore CNNs, which are specialized neural networks for processing grid-like data such as images. They build directly on the concepts of layers, weights, and backpropagation but introduce new layer types like convolutional and pooling layers.

      • Recurrent Neural Networks (RNNs): For sequential data like time series or natural language, you will study RNNs. These networks modify the feed-forward architecture to include loops, allowing information to persist, but they are still trained using a variant of backpropagation.

      • Advanced Optimization: We briefly discussed gradient descent. Future topics will delve into more advanced optimizers like Adam, RMSprop, and Adagrad, which are essential for efficiently training the deep and complex architectures found in CNNs and RNNs.

    🎯 Key Points to Remember

    • Master the core concepts in Neural Networks before moving to advanced topics
    • Practice with previous year questions to understand exam patterns
    • Review short notes regularly for quick revision before exams

    Related Topics in Machine Learning

    More Resources

    Why Choose MastersUp?

    🎯

    AI-Powered Plans

    Personalized study schedules based on your exam date and learning pace

    📚

    15,000+ Questions

    Verified questions with detailed solutions from past papers

    📊

    Smart Analytics

    Track your progress with subject-wise performance insights

    🔖

    Bookmark & Revise

    Save important questions for quick revision before exams

    Start Your Free Preparation →

    No credit card required • Free forever for basic features