Classification Models

Overview

Classification represents a cornerstone of supervised machine learning, addressing the fundamental task of assigning a predefined categorical label to an input instance. Given a training dataset of observations with known class memberships, the objective is to construct a model that can accurately predict the class for new, unseen data points. The successful application of these techniques is critical in numerous domains, including pattern recognition, medical diagnosis, and financial risk assessment. A thorough grasp of classification is therefore indispensable for the modern data analyst and computer scientist.

For the purposes of the GATE examination, a deep conceptual understanding of various classification algorithms is paramount. Questions are designed not merely to test rote memorization but to probe the theoretical underpinnings, computational trade-offs, and practical applicability of these models. This chapter is structured to build this requisite level of mastery. We shall systematically dissect the architecture and mathematical foundations of several canonical classifiers, equipping you with the analytical tools necessary to solve complex problems.

Our exploration will proceed from simple, intuitive models to more mathematically sophisticated ones. We will begin with instance-based learning, move to rule-based and probabilistic frameworks, and conclude with powerful discriminative models that define decision boundaries. Throughout this chapter, the emphasis will be placed on the core principles and comparative analysis essential for excelling in the examination.

---

Chapter Contents

| # | Topic | What You'll Learn |
|---|-------|-------------------|
| 1 | k-Nearest Neighbors (k-NN) | Instance-based learning using distance metrics. |
| 2 | Decision Trees | Building a hierarchical, rule-based model. |
| 3 | Naive Bayes Classifier | Applying conditional probability for classification. |
| 4 | Linear Discriminant Analysis (LDA) | Finding linear projections for class separability. |
| 5 | Support Vector Machine (SVM) | Maximizing the margin between data classes. |

---

Learning Objectives

❗ By the End of This Chapter

After completing this chapter, you will be able to:

Explain the fundamental principles, assumptions, and working mechanisms of k-NN, Decision Trees, Naive Bayes, LDA, and SVM.

Compare and contrast the performance characteristics, computational complexity, and limitations of different classification models.

Apply these classification algorithms to solve numerical problems typical of the GATE examination.

Analyze the mathematical formulations that underpin each classifier, including distance metrics, probabilistic theorems, and optimization objectives.

---

We now turn our attention to k-Nearest Neighbors (k-NN)...

Part 1: k-Nearest Neighbors (k-NN)

Introduction

The k-Nearest Neighbors (k-NN) algorithm represents one of the most intuitive and fundamental approaches to supervised machine learning. It is classified as a non-parametric, instance-based learning method. The term "instance-based" signifies that the algorithm does not construct a general internal model from the training data; instead, it stores the entire training dataset and makes predictions by referencing it directly. "Non-parametric" implies that it makes no assumptions about the underlying data distribution, a characteristic that lends it flexibility in handling complex, real-world data structures.

At its core, k-NN operates on the principle of feature similarity, positing that data points with similar features are likely to belong to the same class. For a given unclassified data point, the algorithm identifies the ' $k$ ' most similar instances (the "nearest neighbors") from the training set and assigns the new point to the class that is most common among those neighbors. This process, known as majority voting, makes k-NN a conceptually simple yet powerful tool for classification tasks. Its performance is critically dependent on the choice of ' $k$ ' and the distance metric used to quantify similarity.

📖 k-Nearest Neighbors (k-NN) Classifier

Let $D = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\}$ be a training dataset, where $\mathbf{x}_i \in \mathbb{R}^d$ is a feature vector in a $d$ -dimensional space and $y_i$ is its corresponding class label. Given a new, unclassified data point $\mathbf{x}_q$ , the k-NN algorithm predicts its class label, $\hat{y}_q$ , by identifying the set $N_k(\mathbf{x}_q) \subset D$ containing the $k$ training instances closest to $\mathbf{x}_q$ according to a chosen distance metric. The predicted class is then determined by the majority vote among the labels of the instances in $N_k(\mathbf{x}_q)$ .

Mathematically, the predicted class $\hat{y}_q$ is given by:

\hat{y}_q = \underset{v \in \text{Classes}}{\arg\max} \sum_{(\mathbf{x}_i, y_i) \in N_k(\mathbf{x}_q)} I(v = y_i)

where $I(\cdot)$ is the indicator function, which is 1 if the condition is true and 0 otherwise.

---

Key Concepts

1. The k-NN Classification Algorithm

The operational procedure of the k-NN algorithm is straightforward and can be broken down into a distinct sequence of steps. Let us consider the task of classifying a new query point, $\mathbf{x}_q$ , using a pre-existing labeled training dataset.

The algorithm proceeds as follows:

Choose the value of $k$ : The number of neighbors,

k

, is a hyperparameter that must be selected prior to classification.

Calculate Distances: For the query point

\mathbf{x}_q

, compute the distance to every single training data point

\mathbf{x}_i

in the dataset. A suitable distance metric, such as Euclidean distance, must be employed.

Identify Nearest Neighbors: Sort the computed distances in ascending order and identify the

k

training data points corresponding to the

k

smallest distances. This set constitutes the nearest neighbors.

Conduct Majority Voting: Among these

k

neighbors, count the number of points belonging to each class.

Assign Class: Assign the class label that has the highest frequency (the majority class) among the

k

neighbors to the query point

\mathbf{x}_q

. In the event of a tie, a common strategy is to select the class of the single nearest neighbor or to reduce

k

until the tie is broken. For binary classification, choosing an odd

k

preemptively avoids such ties.

Worked Example:

Problem:
Consider the following 2D dataset with two classes, Class A (●) and Class B (■).

Class A: $A_1(1, 2)$ , $A_2(2, 3)$

Class B: $B_1(5, 4)$ , $B_2(6, 5)$ , $B_3(5, 6)$

A new query point

Q(3, 4)

needs to be classified using the k-NN algorithm with

k=3

. Use the Euclidean distance metric.

Solution:

We will classify the query point $Q(3, 4)$ by finding its 3 nearest neighbors.

Step 1: Calculate the squared Euclidean distance from $Q(3, 4)$ to each training point. We use the squared distance for comparison, as it preserves the order of distances and avoids computationally expensive square root operations during the sorting phase. The formula for squared Euclidean distance between $(x_1, y_1)$ and $(x_2, y_2)$ is $(x_2-x_1)^2 + (y_2-y_1)^2$ .

Distance from $Q$ to $A_1(1, 2)$ :

d(Q, A_1)^2 = (1-3)^2 + (2-4)^2 = (-2)^2 + (-2)^2 = 4 + 4 = 8

Distance from $Q$ to $A_2(2, 3)$ :

d(Q, A_2)^2 = (2-3)^2 + (3-4)^2 = (-1)^2 + (-1)^2 = 1 + 1 = 2

Distance from $Q$ to $B_1(5, 4)$ :

d(Q, B_1)^2 = (5-3)^2 + (4-4)^2 = (2)^2 + (0)^2 = 4 + 0 = 4

Distance from $Q$ to $B_2(6, 5)$ :

d(Q, B_2)^2 = (6-3)^2 + (5-4)^2 = (3)^2 + (1)^2 = 9 + 1 = 10

Distance from $Q$ to $B_3(5, 6)$ :

d(Q, B_3)^2 = (5-3)^2 + (6-4)^2 = (2)^2 + (2)^2 = 4 + 4 = 8

Step 2: Rank the training points based on their squared distance to $Q$ in ascending order.

A_2(2, 3)

: distance squared = 2

B_1(5, 4)

: distance squared = 4

A_1(1, 2)

: distance squared = 8

B_3(5, 6)

: distance squared = 8

B_2(6, 5)

: distance squared = 10

Step 3: Identify the $k=3$ nearest neighbors.

The 3 nearest neighbors are $A_2$ , $B_1$ , and $A_1$ .

Step 4: Perform majority voting on the classes of the neighbors.

The classes of the 3 nearest neighbors are:

$A_2$ : Class A

$B_1$ : Class B

$A_1$ : Class A

The count is: Class A = 2, Class B = 1.

Step 5: Assign the majority class to the query point.

The majority class is Class A.

Answer: The query point $Q(3, 4)$ is classified as Class A.

---

2. Distance Metrics

The notion of "closeness" in k-NN is quantified by a distance metric. The choice of metric is crucial as it defines the shape of the neighborhood and can significantly impact classification outcomes. While numerous metrics exist, the Euclidean distance is the most prevalent for real-valued vector spaces and is most relevant for the GATE examination.

📐 Euclidean Distance

For two points $\mathbf{p} = (p_1, p_2, \dots, p_d)$ and $\mathbf{q} = (q_1, q_2, \dots, q_d)$ in a $d$ -dimensional space, the Euclidean distance is given by:

d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{d} (p_i - q_i)^2}

Variables:

$\mathbf{p}, \mathbf{q}$ : Feature vectors of two data points.

$d$ : The number of dimensions (features).

$p_i, q_i$ : The value of the $i$ -th feature for points $\mathbf{p}$ and $\mathbf{q}$ , respectively.

When to use: This is the standard, default distance metric for k-NN when features are continuous and have a similar scale. It represents the straight-line distance between two points.

Another common metric is the Manhattan distance, which measures distance by summing the absolute differences of the coordinates.

📐 Manhattan Distance (L1 Norm)

For two points $\mathbf{p} = (p_1, p_2, \dots, p_d)$ and $\mathbf{q} = (q_1, q_2, \dots, q_d)$ , the Manhattan distance is:

d(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{d} |p_i - q_i|

When to use: This metric is often preferred in high-dimensional settings or when features represent fundamentally different quantities (e.g., age and income), as it is less sensitive to outliers along a single dimension compared to Euclidean distance.

---

3. The Role of 'k'

The hyperparameter $k$ is the most critical parameter in the k-NN algorithm. Its value directly controls the bias-variance trade-off of the model.

Small $k$ : A small value of $k$ (e.g., $k=1$ ) results in a model with low bias but high variance. The decision boundary will be highly flexible and irregular, closely following the training data. This makes the model very sensitive to noise and outliers, potentially leading to overfitting. The 1-NN classifier, for instance, creates a decision boundary defined by the Voronoi tessellation of the training data.

Large $k$ : A large value of $k$ leads to a model with high bias but low variance. The decision boundary becomes much smoother and is less affected by individual noisy points. However, if $k$ is too large (e.g., $k=n$ , where $n$ is the total number of training points), the model will become trivial and always predict the majority class of the entire dataset, likely leading to underfitting.

The optimal choice of

k

is data-dependent and is typically determined through cross-validation.

❗ Must Remember

For binary classification problems, it is a standard practice to choose an odd value for $k$ . This is done to prevent ties in the majority voting process. If $k$ were even (e.g., $k=2$ ), it would be possible for a query point to have one neighbor from each class, resulting in an ambiguous classification. An odd $k$ guarantees a clear majority.

The effect of $k$ on the decision boundary is illustrated below. With $k=1$ , the boundary is complex and jagged. As $k$ increases to $k=7$ , the boundary becomes significantly smoother.

k = 1 (High Variance)

k = 7 (Low Variance)

---

Problem-Solving Strategies

When faced with a $k$ -NN problem in a time-constrained setting like the GATE exam, efficiency is paramount.

💡 GATE Strategy: Efficient Calculation

Use Squared Distances for Ranking: To find the nearest neighbors, you only need to compare distances. Calculating the squared Euclidean distance, $(x_2-x_1)^2 + (y_2-y_1)^2$ , avoids the computationally intensive square root operation. Since $d_1 > d_2$ implies $d_1^2 > d_2^2$ for non-negative distances, the ranking of neighbors remains the same. Only calculate the actual square root if the question explicitly asks for the distance value.

Systematic Tabulation: Create a small table to keep track of each training point, its calculated squared distance to the query point, and its class. This minimizes calculation errors and makes it easy to sort and select the top $k$ neighbors.

Check for Odd 'k': In binary classification problems, if the question asks for a suitable $k$ , always start by considering odd values first, as they prevent ties. The PYQ from 2024.1 specifically asked for the "minimum odd value," highlighting the importance of this concept.

---

Common Mistakes

It is important to be aware of common pitfalls when applying the $k$ -NN algorithm, especially under exam pressure.

⚠️ Avoid These Errors

❌ Feature Scaling Negligence: Forgetting that $k$ -NN is highly sensitive to the scale of features. A feature with a large range (e.g., salary in rupees) will dominate the distance calculation over a feature with a small range (e.g., years of experience).

✅ Correct Approach: While GATE problems often provide pre-scaled or simple integer coordinates, remember that in a practical scenario, features must be normalized (e.g., to a [0, 1] range) or standardized (to have zero mean and unit variance) before applying

k

-NN.

❌ Using Even k for Binary Classification: Selecting an even value for $k$ in a two-class problem can lead to a tie, where an equal number of neighbors belong to each class.

✅ Correct Approach: Always prefer an odd value for

k

(3, 5, 7, etc.) in binary classification to ensure a clear majority winner.

❌ Computational Complexity Misunderstanding: Assuming $k$ -NN is fast. While the training phase is trivial (it just stores data), the prediction phase is computationally expensive.

✅ Correct Approach: Understand that for each prediction,

k

-NN must compute distances to all

n

training points, making its prediction complexity

O(nd)

, where

d

is the number of dimensions. This makes it unsuitable for large datasets or real-time applications without specialized data structures like k-d trees.

---

Practice Questions

:::question type="NAT" question="A dataset contains points from two classes: Plus (+) and Minus (-). Plus points are located at (2,3) and (3,4). Minus points are at (4,2), (5,1), and (6,2). Using the k-NN algorithm with Euclidean distance, what is the minimum odd value of k for which the query point Q(4,3) will be classified as Plus (+)? " answer="3" hint="Calculate the squared Euclidean distance from Q(4,3) to all points. Rank them and observe the classes of the neighbors for k=1, k=3, k=5." solution="
Step 1: Define the query point and the training data.
Query point $Q(4,3)$ .
Plus (+): $P_1(2,3)$ , $P_2(3,4)$ .
Minus (-): $M_1(4,2)$ , $M_2(5,1)$ , $M_3(6,2)$ .

Step 2: Calculate the squared Euclidean distance from Q to each point.

d(Q, P_1)^2 = (2-4)^2 + (3-3)^2 = (-2)^2 + (0)^2 = 4

d(Q, P_2)^2 = (3-4)^2 + (4-3)^2 = (-1)^2 + (1)^2 = 2

d(Q, M_1)^2 = (4-4)^2 + (2-3)^2 = (0)^2 + (-1)^2 = 1

d(Q, M_2)^2 = (5-4)^2 + (1-3)^2 = (1)^2 + (-2)^2 = 5

d(Q, M_3)^2 = (6-4)^2 + (2-3)^2 = (2)^2 + (-1)^2 = 5

Step 3: Rank the points by their squared distance to Q in ascending order.

M_1(4,2)

: dist² = 1 (Class Minus)

P_2(3,4)

: dist² = 2 (Class Plus)

P_1(2,3)

: dist² = 4 (Class Plus)

M_2(5,1)

: dist² = 5 (Class Minus)

M_3(6,2)

: dist² = 5 (Class Minus)

Step 4: Evaluate the classification for increasing odd values of k.

For k=1: The single nearest neighbor is $M_1$ . The class is Minus.

For k=3: The three nearest neighbors are $M_1$ , $P_2$ , and $P_1$ . Their classes are {Minus, Plus, Plus}. The majority vote is 2 for Plus and 1 for Minus. The classification is Plus.

Result: The minimum odd value of k for which the point is classified as Plus is 3.
"
:::

:::question type="MCQ" question="In the k-NN algorithm, choosing a very small value for $k$ (e.g., $k=1$ ) typically leads to:" options=["A model with high bias and low variance","A model with low bias and high variance","A model with high bias and high variance","A model that is computationally less expensive at prediction time"] answer="A model with low bias and high variance" hint="Consider how a small k affects the model's sensitivity to individual data points, including noise." solution="A small value of $k$ , such as $k=1$ , makes the model's prediction highly dependent on the single closest training point. This allows the decision boundary to be very flexible and closely fit the training data, capturing intricate patterns. This corresponds to low bias. However, this extreme flexibility also means the model is very sensitive to noise and outliers in the training data, which leads to high variance. A single noisy data point can significantly alter the classification of nearby query points. Therefore, a small $k$ leads to low bias and high variance, a characteristic of overfitting."
:::

:::question type="MSQ" question="Which of the following statements about the $k$ -NN algorithm are correct?" options=["k-NN is a non-parametric model.","k-NN is an eager learning algorithm.","The prediction time complexity of k-NN is independent of the size of the training set.","Performance of k-NN can be sensitive to feature scaling."] answer="k-NN is a non-parametric model.,Performance of k-NN can be sensitive to feature scaling." hint="Evaluate each statement based on the core properties of k-NN. Consider its learning style (lazy vs. eager) and its reliance on distance calculations." solution="

$k$ -NN is a non-parametric model: This is correct. Non-parametric means the model does not make any assumptions about the underlying data distribution (e.g., it does not assume data is Gaussian).

$k$ -NN is an eager learning algorithm: This is incorrect. $k$ -NN is a lazy learning algorithm because it does not build a model during the training phase. It simply stores the entire training dataset. The main computation happens during the prediction/testing phase. Eager learners, like logistic regression or SVM, build a generalized model from the training data beforehand.

The prediction time complexity of $k$ -NN is independent of the size of the training set: This is incorrect. For each new point to be classified, a naive $k$ -NN algorithm must compute the distance to every one of the $n$ points in the training set. Thus, its prediction time complexity is typically $O(nd)$ , where $n$ is the number of training samples and $d$ is the number of features.

Performance of $k$ -NN can be sensitive to feature scaling: This is correct. Since $k$ -NN relies on distance metrics like Euclidean distance, features with larger scales can disproportionately influence the distance calculation. For instance, if one feature ranges from 0 to 1000 and another from 0 to 1, the first feature will dominate the distance. Therefore, it is standard practice to scale features (e.g., through normalization or standardization) before applying $k$ -NN.

"
:::

:::question type="NAT" question="Calculate the Euclidean distance between the point $P(2, -1, 4)$ and $Q(5, 3, 2)$ in 3-dimensional space. (Round off to two decimal places)" answer="5.39" hint="Use the 3D version of the Euclidean distance formula: $\sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2}$ ." solution="
Step 1: Identify the coordinates of the two points.

P = (x_1, y_1, z_1) = (2, -1, 4)

Q = (x_2, y_2, z_2) = (5, 3, 2)

Step 2: Apply the Euclidean distance formula.

d(P, Q) = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2 + (z_2-z_1)^2}

Step 3: Substitute the coordinate values into the formula.

d(P, Q) = \sqrt{(5-2)^2 + (3-(-1))^2 + (2-4)^2}

Step 4: Simplify the expression inside the square root.

d(P, Q) = \sqrt{(3)^2 + (4)^2 + (-2)^2}

d(P, Q) = \sqrt{9 + 16 + 4}

d(P, Q) = \sqrt{29}

Step 5: Calculate the final value and round to two decimal places.

d(P, Q) \approx 5.38516

Result:
Rounding to two decimal places, the distance is 5.39.
"
:::

---

Summary

❗ Key Takeaways for GATE

Lazy Learning: $k$ -NN is an instance-based, lazy learning algorithm. It performs no computation during training, simply storing the dataset. The computation is deferred to prediction time.

Core Mechanism: The algorithm classifies a new point based on the majority class of its $k$ nearest neighbors in the feature space.

Euclidean Distance: For the GATE exam, be thoroughly prepared to calculate Euclidean distances between points in 2D or 3D space quickly and accurately. Remember the formula:

d = \sqrt{\sum (p_i - q_i)^2}

The Role of 'k': The choice of $k$ controls the bias-variance trade-off. A small $k$ leads to high variance (overfitting), while a large $k$ leads to high bias (underfitting). For binary classification, an odd $k$ is strongly preferred to avoid ties.

Sensitivity to Scale: As a distance-based algorithm, $k$ -NN's performance is sensitive to the scale of the features.

---

What's Next?

💡 Continue Learning

Mastering $k$ -Nearest Neighbors provides a foundation for understanding other machine learning concepts. This topic connects to:

Feature Scaling (Normalization and Standardization): Since $k$ -NN is sensitive to the magnitude of features, understanding how to scale data is crucial. This is a vital preprocessing step for many ML algorithms.

The Curse of Dimensionality: Explore why distance-based methods like $k$ -NN struggle in high-dimensional spaces. As dimensions increase, the distance between any two points tends to become uniform, making the concept of "nearest neighbor" less meaningful.

Other Classification Algorithms: Compare $k$ -NN's non-parametric nature with parametric models like Logistic Regression and Support Vector Machines (SVMs). Understand their different assumptions, decision boundaries, and computational trade-offs.

---

💡 Moving Forward

Now that you understand $k$ -Nearest Neighbors ( $k$ -NN), let's explore Decision Trees which builds on these concepts.

---

Part 2: Decision Trees

Introduction

Decision Trees represent one of the most intuitive and fundamental models in supervised machine learning. Employed for both classification and regression tasks, they partition the feature space into a set of hierarchical, conditional rules, culminating in a structure that resembles an inverted tree. At the apex of this structure is the root node, which represents the entire dataset. This dataset is recursively partitioned at each internal node based on the values of a selected attribute. This process continues until the subsets at the nodes are sufficiently pure, or some other stopping criterion is met, at which point a leaf node, or terminal node, is created to assign a class label or a continuous value.

The core challenge in constructing an effective decision tree lies in the selection of attributes for splitting the data at each node. An optimal split is one that maximally separates the classes, resulting in child nodes that are more homogeneous, or "purer," than the parent node. The algorithm must greedily select the attribute that provides the most information about the target variable at each step. To quantify this notion of purity and the effectiveness of a split, we employ mathematical measures such as Entropy and Gini Impurity. A thorough understanding of these metrics, and the concept of Information Gain which is derived from them, is paramount for mastering the construction and interpretation of decision trees, a frequent topic of inquiry in competitive examinations like GATE.

---

Key Concepts

1. Structure of a Decision Tree

A decision tree is a hierarchical model composed of several key components. Let us formalize these elements, as their interplay defines the model's predictive logic.

* Root Node: The topmost node in the tree, representing the entire training dataset before any splits have been made.
* Internal Node (or Decision Node): A node that represents a test on an attribute. It splits the data into two or more subsets based on the outcome of the test. Each internal node has one incoming branch and two or more outgoing branches.
* Branch (or Edge): A link between two nodes, representing the outcome of the test at the parent node. Each branch is typically labeled with a value or a range of values for the attribute tested.
* Leaf Node (or Terminal Node): A node that does not split any further. It represents a final decision or a class label. In a classification tree, the leaf node contains the predicted class for instances that traverse the path leading to it.

Consider the following visual representation of a simple decision tree.

Root Node (Test on A1)

Internal Node (Test on A2)

Leaf (Class 1)

Leaf (Class 2)

Leaf (Class 1)

A1 = v1

A1 = v2

A2 = v3

A2 = v4

2. The Splitting Process: Measuring Impurity

The fundamental principle guiding the construction of a decision tree is the reduction of impurity. At each node, we seek to find an attribute test that splits the data into subsets that are as homogeneous as possible with respect to the target variable. A perfectly homogeneous, or pure, subset contains instances of only one class. We require a quantitative measure of this impurity. The two most prominent measures used in classification trees are Entropy and Gini Impurity.

📖 Impurity

In the context of decision trees, impurity is a measure of the heterogeneity of the labels at a node. A node is considered pure (impurity = $0$ ) if all its data samples belong to a single class, and maximally impure if the samples are evenly distributed among all classes.

3. Entropy

Originating from information theory, entropy quantifies the level of uncertainty or randomness in a set of data. For a given set of examples $S$ , if there are $c$ distinct classes, the entropy is a measure of how mixed these classes are.

📐 Entropy

Entropy(S) = \sum_{i=1}^{c} -p_i \log_2(p_i)

Variables:

$S$ = The set of data samples at a given node.

$c$ = The number of distinct classes.

$p_i$ = The proportion of samples in $S$ that belong to class $i$ .

When to use: This is the core calculation for the ID3 algorithm and is fundamental to computing Information Gain.

The value of entropy ranges from $0$ to $\log_2(c)$ .

$Entropy(S) = 0$ if the set $S$ is perfectly pure (all samples belong to one class, so one $p_i = 1$ and all others are $0$ ).

$Entropy(S) = \log_2(c)$ if the set $S$ is maximally impure (samples are uniformly distributed among all $c$ classes, so $p_i = 1/c$ for all $i$ ). For a binary classification problem ( $c=2$ ), the maximum entropy is $\log_2(2) = 1$ .

4. Information Gain

Information Gain is the metric used to decide which attribute to split on at each step in building the tree. It measures the expected reduction in entropy caused by partitioning the examples according to a given attribute. The attribute that yields the highest information gain is chosen for the split.

📐 Information Gain

IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)

Variables:

$S$ = The set of data samples at the parent node.

$A$ = The attribute being tested for the split.

$Values(A)$ = The set of all possible values for attribute $A$ .

$S_v$ = The subset of $S$ for which attribute $A$ has value $v$ .

$|S|$ = The number of samples in set $S$ .

$|S_v|$ = The number of samples in subset $S_v$ .

When to use: Used by algorithms like ID3 to select the best splitting attribute at any given node. The attribute with the maximum

IG(S, A)

is selected.

The second term in the formula, $\sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)$ , represents the weighted average entropy of the child nodes after splitting on attribute $A$ . Information Gain is therefore simply the difference between the parent node's entropy and the weighted average entropy of its children.

Worked Example:

Problem:
Consider the following dataset of 14 student applications for a postgraduate program. We wish to build a decision tree to predict the 'Admission' outcome. Calculate the Information Gain for the attribute 'GRE Score'.

| Applicant ID | GRE Score | Undergrad CGPA | Research Exp. | Admission (Target) |
|--------------|---------------------------|---------------|--------------------|
| 1 | High | > 8.5 | Yes | Yes |
| 2 | High | > 8.5 | No | Yes |
| 3 | Medium | > 8.5 | Yes | Yes |
| 4 | Medium | <= 8.5 | No | No |
| 5 | Low | <= 8.5 | No | No |
| 6 | Low | > 8.5 | Yes | No |
| 7 | High | <= 8.5 | Yes | Yes |
| 8 | Medium | <= 8.5 | Yes | Yes |
| 9 | High | > 8.5 | Yes | Yes |
| 10 | Medium | > 8.5 | No | Yes |
| 11 | Low | <= 8.5 | Yes | No |
| 12 | Medium | > 8.5 | Yes | Yes |
| 13 | Medium | <= 8.5 | No | No |
| 14 | High | <= 8.5 | No | Yes |

Solution:

Let $S$ be the entire dataset of 14 applicants.
First, we count the number of 'Yes' and 'No' outcomes for the 'Admission' target class.

Number of 'Yes' ( $N_{Yes}$ ) = 9

Number of 'No' ( $N_{No}$ ) = 5

Total samples $|S|$ = 14

Step 1: Calculate the entropy of the parent node,

Entropy(S)

The proportions are $p_{Yes} = \frac{9}{14}$ and $p_{No} = \frac{5}{14}$ .

Entropy(S) = - \left( \frac{9}{14} \log_2\left(\frac{9}{14}\right) + \frac{5}{14} \log_2\left(\frac{5}{14}\right) \right)

Entropy(S) = - \left( \frac{9}{14} \times (-0.637) + \frac{5}{14} \times (-1.485) \right)

Entropy(S) = - ( -0.4095 - 0.5304 )

Entropy(S) = 0.9399

Step 2: Partition the data based on the attribute 'GRE Score' and calculate the entropy for each subset.

The attribute 'GRE Score' has three values: 'High', 'Medium', 'Low'.

* For GRE Score = 'High' ( $S_{High}$ ):
* Total samples $|S_{High}| = 5$ .
* Outcomes: 5 'Yes', 0 'No'.
* This subset is pure.
* $p_{Yes} = \frac{5}{5} = 1$ , $p_{No} = \frac{0}{5} = 0$ .
* $Entropy(S_{High}) = - (1 \log_2(1) + 0 \log_2(0)) = 0$ . (Note: $0 \log_2(0)$ is defined as 0).

* For GRE Score = 'Medium' ( $S_{Medium}$ ):
* Total samples $|S_{Medium}| = 6$ .
* Outcomes: 4 'Yes', 2 'No'.
* $p_{Yes} = \frac{4}{6} = \frac{2}{3}$ , $p_{No} = \frac{2}{6} = \frac{1}{3}$ .

Entropy(S_{Medium}) = - \left( \frac{2}{3} \log_2\left(\frac{2}{3}\right) + \frac{1}{3} \log_2\left(\frac{1}{3}\right) \right)

Entropy(S_{Medium}) = - \left( \frac{2}{3} \times (-0.585) + \frac{1}{3} \times (-1.585) \right)

Entropy(S_{Medium}) = -(-0.39 - 0.528)

Entropy(S_{Medium}) = 0.918

* For GRE Score = 'Low' ( $S_{Low}$ ):
* Total samples $|S_{Low}| = 3$ .
* Outcomes: 0 'Yes', 3 'No'.
* This subset is pure.
* $p_{Yes} = \frac{0}{3} = 0$ , $p_{No} = \frac{3}{3} = 1$ .
* $Entropy(S_{Low}) = - (0 \log_2(0) + 1 \log_2(1)) = 0$ .

Step 3: Calculate the weighted average entropy of the children nodes.

\text{Weighted Avg Entropy} = \frac{|S_{High}|}{|S|} Entropy(S_{High}) + \frac{|S_{Medium}|}{|S|} Entropy(S_{Medium}) + \frac{|S_{Low}|}{|S|} Entropy(S_{Low})

\text{Weighted Avg Entropy} = \left(\frac{5}{14} \times 0\right) + \left(\frac{6}{14} \times 0.918\right) + \left(\frac{3}{14} \times 0\right)

\text{Weighted Avg Entropy} = 0 + 0.3934 + 0

\text{Weighted Avg Entropy} = 0.3934

Step 4: Compute the Information Gain for the attribute 'GRE Score'.

IG(S, \text{'GRE Score'}) = Entropy(S) - \text{Weighted Avg Entropy}

IG(S, \text{'GRE Score'}) = 0.9399 - 0.3934

IG(S, \text{'GRE Score'}) = 0.5465

Answer: The Information Gain for splitting on 'GRE Score' is approximately $0.5465$ . The algorithm would repeat this calculation for 'Undergrad CGPA' and 'Research Exp.' to find the attribute with the highest gain for the root node split.

---

5. Gini Impurity

Gini Impurity is an alternative measure of impurity used by the CART (Classification and Regression Tree) algorithm. It measures the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the distribution of labels in the subset.

📐 Gini Impurity

Gini(S) = 1 - \sum_{i=1}^{c} p_i^2

Variables:

$S$ = The set of data samples at a given node.

$c$ = The number of distinct classes.

$p_i$ = The proportion of samples in $S$ that belong to class $i$ .

When to use: This is the default impurity measure for the CART algorithm. It is computationally less intensive than entropy as it does not require logarithmic calculations.

The Gini Impurity ranges from $0$ to $1 - \frac{1}{c}$ .

$Gini(S) = 0$ if the set $S$ is pure.

For binary classification ( $c=2$ ), the maximum Gini Impurity is $1 - (0.5^2 + 0.5^2) = 0.5$ .

6. Gini Gain

Analogous to Information Gain, Gini Gain (or Gini Index split criterion) measures the reduction in impurity achieved by splitting on an attribute. The CART algorithm selects the attribute that maximizes the Gini Gain.

📐 Gini Gain

GiniGain(S, A) = Gini(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Gini(S_v)

Variables:

$S$ = The set of data samples at the parent node.

$A$ = The attribute being tested for the split.

$S_v$ = The subset of $S$ for which attribute $A$ has value $v$ .

When to use: Used by the CART algorithm to select the optimal splitting attribute.

---

Problem-Solving Strategies

When faced with a numerical question on decision tree splits in GATE, a systematic approach is crucial to ensure accuracy under time constraints.

💡 GATE Strategy: Systematic Impurity Calculation

To compute Information Gain or Gini Gain for an attribute $A$ :

Calculate Parent Impurity: Compute the initial impurity of the entire dataset $S$ before any split. This is $Entropy(S)$ or $Gini(S)$ . Count the total number of samples for each class to find the proportions $p_i$ .

Partition Data: For the given attribute $A$ , create a frequency table. For each value $v$ of $A$ , count the number of samples belonging to each class. This gives you the counts for each subset $S_v$ .

Calculate Child Impurity: For each subset $S_v$ , calculate its impurity, $Entropy(S_v)$ or $Gini(S_v)$ , using the class counts from the previous step.

Compute Weighted Average: Calculate the weighted average of the child impurities using the formula $\sum \frac{|S_v|}{|S|} \times Impurity(S_v)$ .

Find the Gain: Subtract the result from Step 4 from the result of Step 1. This gives you the final gain.

This structured process minimizes calculation errors and makes it easy to double-check your work.

---

Common Mistakes

Students often make predictable errors when calculating these metrics. Awareness of these pitfalls is the first step toward avoiding them.

⚠️ Avoid These Errors

❌ Using Natural Logarithm: A frequent error is using $\ln$ instead of $\log_2$ when calculating entropy. Information theory, and thus Information Gain, is based on bits of information, which necessitates the use of a base-2 logarithm.

✅ Correct Approach: Always use

\log_2

. If your calculator only has

\ln

and

\log_{10}

, use the change of base formula:

\log_2(x) = \frac{\ln(x)}{\ln(2)}

\frac{\log_{10}(x)}{\log_{10}(2)}

❌ Forgetting the Weights: When calculating the average impurity of the children, it is easy to forget to weight each child's impurity by its relative size ( $\frac{|S_v|}{|S|}$ ). Simply averaging the impurities is incorrect unless all child nodes have the same number of samples.

✅ Correct Approach: Always multiply the impurity of each child node by its proportion of the total samples before summing them up.

❌ Incorrect Gini Formula: Some students mistakenly calculate only the sum of squared proportions, $\sum p_i^2$ , and forget to subtract it from $1$ .

✅ Correct Approach: The complete formula is

Gini(S) = 1 - \sum p_i^2

. Remember that Gini impurity measures the probability of misclassification, which is complementary to the probability of correct classification.

---

Practice Questions

:::question type="NAT" question="A dataset contains 20 samples. A split on attribute 'X' divides the data into two subsets. Subset 1 has 8 samples, with an entropy of $0.81$ . Subset 2 has 12 samples, with an entropy of $0.92$ . If the entropy of the entire dataset before the split was $0.99$ , what is the Information Gain of splitting on attribute 'X'? (rounded off to two decimal places)." answer="0.11" hint="Calculate the weighted average entropy of the child nodes and subtract it from the parent node's entropy." solution="
Step 1: Identify the given values.

Entropy of parent node, $Entropy(S) = 0.99$

Total samples, $|S| = 20$

Subset 1: $|S_1| = 8$ , $Entropy(S_1) = 0.81$

Subset 2: $|S_2| = 12$ , $Entropy(S_2) = 0.92$

Step 2: Calculate the weighted average entropy of the children.

\text{Weighted Avg Entropy} = \frac{|S_1|}{|S|} Entropy(S_1) + \frac{|S_2|}{|S|} Entropy(S_2)

\text{Weighted Avg Entropy} = \left(\frac{8}{20} \times 0.81\right) + \left(\frac{12}{20} \times 0.92\right)

\text{Weighted Avg Entropy} = (0.4 \times 0.81) + (0.6 \times 0.92)

\text{Weighted Avg Entropy} = 0.324 + 0.552 = 0.876

Step 3: Calculate the Information Gain.

IG(S, X) = Entropy(S) - \text{Weighted Avg Entropy}

IG(S, X) = 0.99 - 0.876

IG(S, X) = 0.114

Result:
Rounding to two decimal places, the Information Gain is $0.11$ .
Answer: \boxed{0.11}
"
:::

:::question type="MCQ" question="For a binary classification problem, a node in a decision tree contains 10 samples of Class A and 10 samples of Class B. What is the Gini Impurity of this node?" options=[" $0$ ", " $0.25$ ", " $0.5$ ", " $1.0$ "] answer="0.5" hint="Use the Gini Impurity formula, $1 - \sum p_i^2$ , for a maximally impure binary node." solution="
Step 1: Determine the proportions of each class.

Total samples = $10 + 10 = 20$

Proportion of Class A, $p_A = \frac{10}{20} = 0.5$

Proportion of Class B, $p_B = \frac{10}{20} = 0.5$

Step 2: Apply the Gini Impurity formula.

Gini(S) = 1 - (p_A^2 + p_B^2)

Gini(S) = 1 - (0.5^2 + 0.5^2)

Gini(S) = 1 - (0.25 + 0.25)

Gini(S) = 1 - 0.5

Gini(S) = 0.5

Result:
The Gini Impurity is $0.5$ , which is the maximum possible value for a binary classification problem.
Answer: \boxed{0.5}
"
:::

:::question type="MSQ" question="Which of the following statements about decision tree splitting criteria are correct?" options=["The ID3 algorithm uses Information Gain to select the best split.", "The CART algorithm uses Gini Impurity to select the best split.", "Information Gain can be negative if a split results in more impure child nodes.", "An attribute with higher Information Gain is preferred over an attribute with lower Information Gain."] answer="The ID3 algorithm uses Information Gain to select the best split.,The CART algorithm uses Gini Impurity to select the best split.,An attribute with higher Information Gain is preferred over an attribute with lower Information Gain." hint="Recall the standard algorithms and the definition of Information Gain." solution="

Option A is correct. The ID3 (Iterative Dichotomiser 3) algorithm is the classic decision tree algorithm that uses Information Gain as its splitting criterion.

Option B is correct. The CART (Classification and Regression Trees) algorithm uses Gini Impurity (and Gini Gain) for classification trees.

Option C is incorrect. Information Gain is defined as $Entropy(parent) - \text{WeightedAvgEntropy}(children)$ . Since entropy is always non-negative ( $Entropy \ge 0$ ), the Information Gain must also be non-negative. A split cannot result in a higher weighted average entropy than the parent's entropy. At worst, a useless split results in an Information Gain of $0$ .

Option D is correct. The core principle of greedy decision tree construction is to select the attribute that maximizes the reduction in impurity. Therefore, a higher Information Gain signifies a better split, and is always preferred.

Answer: \boxed{The ID3 algorithm uses Information Gain to select the best split.,The CART algorithm uses Gini Impurity to select the best split.,An attribute with higher Information Gain is preferred over an attribute with lower Information Gain.}
"
:::

:::question type="NAT" question="A dataset for loan approval prediction has 10 'Approved' and 6 'Rejected' applications. What is the Gini Impurity of this dataset? Calculate the value rounded to three decimal places." answer="0.469" hint="Calculate the proportions of each class and apply the Gini Impurity formula, $1 - \sum p_i^2$ ." solution="
Step 1: Find the total number of samples and the proportion of each class.

Total samples $|S| = 10 (\text{Approved}) + 6 (\text{Rejected}) = 16$

Proportion Approved, $p_{Approved} = \frac{10}{16} = 0.625$

Proportion Rejected, $p_{Rejected} = \frac{6}{16} = 0.375$

Step 2: Apply the Gini Impurity formula.

Gini(S) = 1 - (p_{Approved}^2 + p_{Rejected}^2)

Gini(S) = 1 - (0.625^2 + 0.375^2)

Gini(S) = 1 - (0.390625 + 0.140625)

Gini(S) = 1 - 0.53125

Gini(S) = 0.46875

Result:
Rounding to three decimal places, the Gini Impurity is $0.469$ .
Answer: \boxed{0.469}
"
:::

---

Summary

❗ Key Takeaways for GATE

Core Principle: Decision trees are built by recursively splitting the data to maximize the purity of the resulting child nodes. This is a greedy, top-down approach.

Splitting Criteria: The choice of the best attribute for a split is determined by a quantitative measure. The two primary criteria are Information Gain (based on Entropy) and Gini Gain (based on Gini Impurity).

Formula Mastery: You must have perfect recall of the formulas for Entropy, Gini Impurity, Information Gain, and Gini Gain. Be particularly careful with the base of the logarithm ( $\log_2$ ) and the weighting of child node impurities.

Application: Be prepared for numerical problems that provide a small dataset and ask you to compute one of these metrics for a specific attribute, as this directly tests your understanding of the tree-building process.

---

What's Next?

💡 Continue Learning

A solid understanding of decision trees is a gateway to more advanced and powerful machine learning concepts.

Ensemble Methods (Random Forest, Gradient Boosting): Decision trees serve as the fundamental building blocks (base learners) for these highly effective ensemble models. Random Forest builds many decision trees on different subsets of data and features, while Gradient Boosting builds trees sequentially to correct the errors of previous ones.
Pruning and Overfitting: A single decision tree can easily overfit the training data by growing too deep and capturing noise. Techniques like pre-pruning (setting stopping criteria) and post-pruning (removing branches after the tree is built) are essential for creating generalizable models.
Regression Trees: The decision tree framework can be adapted for regression tasks. Instead of using impurity measures like entropy, regression trees use metrics like Mean Squared Error (MSE) to evaluate splits, aiming to minimize variance in the leaf nodes.

---

💡 Moving Forward

Now that you understand Decision Trees, let's explore Naive Bayes Classifier which builds on these concepts.

---

Part 3: Naive Bayes Classifier

Introduction

The Naive Bayes classifier represents a family of simple, yet surprisingly powerful, probabilistic classifiers based on applying Bayes' theorem with a strong (or "naive") independence assumption between the features. It is a supervised learning algorithm predominantly used for classification tasks, such as text classification (e.g., spam detection) and medical diagnosis. Despite its simplicity and the often unrealistic nature of its core assumption, the Naive Bayes classifier frequently performs well in practice, particularly in domains where the dimensionality of the feature space is high.

Our study of this model will focus on its theoretical underpinnings, the mathematical formulation of its decision rule, and the practical considerations for its application. We will dissect the conditional independence assumption that gives the model its name and explore how parameters are estimated from a training dataset. Understanding this classifier is fundamental, as it provides a clear entry point into the world of probabilistic machine learning and serves as a crucial baseline for more complex models. For the GATE examination, a firm grasp of its mechanics, including parameter counting and probability calculations, is essential.

📖 Naive Bayes Classifier

A Naive Bayes classifier is a probabilistic machine learning model used for classification tasks. It calculates the probability of a data point belonging to a particular class, given a set of features. The classification decision is based on the class with the maximum posterior probability, computed using Bayes' theorem. The model's core characteristic is the "naive" assumption that all features are mutually conditionally independent, given the class label.

---

Key Concepts

1. The Probabilistic Foundation: Bayes' Theorem

The foundation of the Naive Bayes classifier is Bayes' theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. In the context of classification, we are interested in finding the probability of a class $C_k$ given a feature vector $\mathbf{x} = (x_1, x_2, \dots, x_n)$ .

Bayes' theorem provides a way to calculate this posterior probability, $P(C_k | \mathbf{x})$ :

P(C_k | \mathbf{x}) = \frac{P(\mathbf{x} | C_k) P(C_k)}{P(\mathbf{x})}

Let us break down the components of this equation:

$P(C_k | \mathbf{x})$ is the posterior probability: the probability of class $C_k$ after observing the feature vector $\mathbf{x}$ . This is what we want to compute.

$P(\mathbf{x} | C_k)$ is the likelihood: the probability of observing the feature vector $\mathbf{x}$ given that the class is $C_k$ .

$P(C_k)$ is the prior probability: the initial probability of class $C_k$ before observing any data.

$P(\mathbf{x})$ is the evidence or marginal probability: the total probability of observing the feature vector $\mathbf{x}$ . It is constant for all classes for a given input $\mathbf{x}$ .

For classification, our goal is to find the class

C_k

that is most probable given the data

\mathbf{x}

. This is known as the Maximum A Posteriori (MAP) decision rule:

\hat{y} = \arg\max_{k} P(C_k | \mathbf{x})

Since the evidence $P(\mathbf{x})$ is the same for all classes, it acts as a normalization constant and does not affect the relative ranking of the class probabilities. Therefore, we can simplify the decision rule by ignoring the denominator:

\hat{y} = \arg\max_{k} P(\mathbf{x} | C_k) P(C_k)

2. The 'Naive' Assumption: Conditional Independence

Calculating the likelihood term $P(\mathbf{x} | C_k) = P(x_1, x_2, \dots, x_n | C_k)$ directly is computationally intensive and requires a very large dataset to estimate the joint probability distribution of all features. To overcome this challenge, the Naive Bayes classifier makes a simplifying assumption.

The Naive Conditional Independence Assumption: All features $x_i$ are assumed to be conditionally independent of each other, given the class $C_k$ .

Mathematically, this assumption allows us to express the joint likelihood as a product of individual likelihoods for each feature:

P(\mathbf{x} | C_k) = P(x_1, x_2, \dots, x_n | C_k) = \prod_{i=1}^{n} P(x_i | C_k)

This assumption is "naive" because in most real-world scenarios, features are not perfectly independent. For instance, in text classification, the presence of the word "discount" might be correlated with the presence of the word "offer". However, this simplification makes the computation tractable and the model surprisingly effective.

C

X₁

X₂

X₃

Xₙ
...

Graphical Model: The class $C$ influences each feature $X_i$ independently.

3. The Naive Bayes Model for Classification

By substituting the conditional independence assumption into our MAP decision rule, we arrive at the final form of the Naive Bayes classifier.

📐 Naive Bayes Classification Rule

\hat{y} = \arg\max_{k} \left( P(C_k) \prod_{i=1}^{n} P(x_i | C_k) \right)

Variables:

$\hat{y}$ = Predicted class label.

$C_k$ = The $k$ -th class.

$P(C_k)$ = The prior probability of class $C_k$ .

$P(x_i | C_k)$ = The conditional probability of observing feature $x_i$ given class $C_k$ .

$\mathbf{x} = (x_1, \dots, x_n)$ = The feature vector of the instance to be classified.

When to use: This formula is the core decision rule for any Naive Bayes classifier. It is applied after the prior and conditional probabilities have been estimated from the training data.

In practice, the product of many small probabilities can lead to numerical underflow. To mitigate this, we often work with the log-transformed version of the posterior probability, which turns the product into a sum:

\hat{y} = \arg\max_{k} \left( \log(P(C_k)) + \sum_{i=1}^{n} \log(P(x_i | C_k)) \right)

Since the logarithm is a monotonically increasing function, maximizing the log-probability is equivalent to maximizing the probability itself.

4. Parameter Estimation

The Naive Bayes model is defined by its parameters: the prior probabilities $P(C_k)$ and the conditional probabilities $P(x_i | C_k)$ . These are typically estimated from the training data using Maximum Likelihood Estimation (MLE).

Estimating Priors $P(C_k)$ :
The prior probability for a class $C_k$ is estimated as the relative frequency of that class in the training data.

\hat{P}(C_k) = \frac{\text{Number of samples in class } C_k}{\text{Total number of samples}}

Estimating Conditional Probabilities $P(x_i | C_k)$ :
The estimation of this term depends on the nature of the feature $x_i$ .

For categorical/discrete features: The conditional probability is the relative frequency of feature value $x_i$ among all samples belonging to class $C_k$ .

For continuous features: A common approach is to assume that the feature values for each class are drawn from a specific probability distribution, such as a Gaussian distribution (leading to the Gaussian Naive Bayes classifier). The parameters of this distribution (e.g., mean and variance) are then estimated from the training data for each class.

Let us now analyze the number of parameters that must be estimated, a crucial concept for GATE.

Worked Example: Parameter Counting

Problem:
Consider a two-class classification problem (Class A, Class B) with a dataset having $K$ binary-valued attributes ( $X_1, X_2, \dots, X_K$ ). Determine the total number of independent probability parameters that need to be estimated to build a Naive Bayes classifier.

Solution:

We need to estimate two sets of parameters: the class priors and the feature conditional probabilities.

Step 1: Estimate parameters for the class priors, $P(C_k)$ .

For a two-class problem (A and B), we need to estimate $P(\text{Class A})$ and $P(\text{Class B})$ . However, these probabilities must sum to 1.

P(\text{Class A}) + P(\text{Class B}) = 1

Therefore, if we estimate

P(\text{Class A})

, the value of

P(\text{Class B})

is automatically determined as

1 - P(\text{Class A})

. Thus, we only need to estimate 1 independent parameter for the priors.

Step 2: Estimate parameters for the conditional probabilities, $P(X_i | C_k)$ .

Each of the $K$ attributes is binary, meaning it can take two values (e.g., 0 or 1). For each attribute $X_i$ and for each class $C_k$ , we need to estimate the probabilities.
Consider attribute $X_i$ and Class A. We need to estimate $P(X_i=0 | \text{Class A})$ and $P(X_i=1 | \text{Class A})$ . Since these must sum to 1:

P(X_i=0 | \text{Class A}) + P(X_i=1 | \text{Class A}) = 1

We only need to estimate one of them (e.g.,

P(X_i=1 | \text{Class A})

). The other is determined.
The same logic applies to Class B. So, for each attribute

X_i

, we need to estimate 2 parameters:

P(X_i=1 | \text{Class A})

and

P(X_i=1 | \text{Class B})

Step 3: Calculate the total number of conditional probability parameters.

Since there are $K$ independent attributes, and each requires 2 parameters (one for each class), the total number of parameters for the conditional probabilities is:

K \times 2 = 2K

Step 4: Sum the parameters from priors and conditional probabilities.

Total parameters = (Parameters for priors) + (Parameters for conditional probabilities)

\text{Total} = 1 + 2K

Answer: The total number of parameters to be estimated is $2K + 1$ .

5. Making Predictions and Evaluating Misclassification

Once the model parameters are learned, we can classify a new instance $\mathbf{x}$ . This involves calculating the proportional posterior for each class and selecting the class with the highest score. It is also important to understand the concept of misclassification probability.

Worked Example: Classification and Misclassification Probability

Problem:
A Naive Bayes classifier is used for a binary classification problem with classes $C_1$ and $C_2$ . The prior probabilities are $P(C_1) = 0.2$ and $P(C_2) = 0.8$ . For a new data point with feature vector $\mathbf{x}$ , the class-conditional probabilities (likelihoods) are found to be $P(\mathbf{x}|C_1) = 0.5$ and $P(\mathbf{x}|C_2) = 0.25$ . What is the probability of misclassifying $\mathbf{x}$ using the MAP decision rule?

Solution:

Step 1: Calculate the unnormalized posterior for each class.

We use the formula $\text{Score}(C_k) \propto P(\mathbf{x}|C_k)P(C_k)$ .

For class $C_1$ :

\text{Score}(C_1) = P(\mathbf{x}|C_1) P(C_1) = 0.5 \times 0.2 = 0.10

For class $C_2$ :

\text{Score}(C_2) = P(\mathbf{x}|C_2) P(C_2) = 0.25 \times 0.8 = 0.20

Step 2: Apply the MAP rule to predict the class.

We compare the scores:

\text{Score}(C_2) = 0.20 > \text{Score}(C_1) = 0.10

The MAP rule predicts that the instance

\mathbf{x}

belongs to class

C_2

Step 3: Calculate the evidence term $P(\mathbf{x})$ to normalize the posteriors.

The evidence is the sum of the unnormalized posteriors over all classes.

P(\mathbf{x}) = \sum_{k} P(\mathbf{x}|C_k)P(C_k) = \text{Score}(C_1) + \text{Score}(C_2)

P(\mathbf{x}) = 0.10 + 0.20 = 0.30

Step 4: Calculate the true posterior probabilities for each class.

P(C_1|\mathbf{x}) = \frac{\text{Score}(C_1)}{P(\mathbf{x})} = \frac{0.10}{0.30} = \frac{1}{3}

P(C_2|\mathbf{x}) = \frac{\text{Score}(C_2)}{P(\mathbf{x})} = \frac{0.20}{0.30} = \frac{2}{3}

Step 5: Determine the probability of misclassification.

The classifier predicts class $C_2$ . The probability that this prediction is correct is the posterior probability of class $C_2$ , which is $P(C_2|\mathbf{x}) = \frac{2}{3}$ .
The probability of misclassification is the probability that the instance actually belongs to the other class ( $C_1$ ), which is $P(C_1|\mathbf{x})$ .

P(\text{misclassification}) = 1 - P(\text{predicted class} | \mathbf{x}) = 1 - P(C_2|\mathbf{x})

P(\text{misclassification}) = 1 - \frac{2}{3} = \frac{1}{3}

Alternatively, it is simply the sum of the posterior probabilities of all the non-predicted classes. In this binary case, it is just

P(C_1|\mathbf{x}) = \frac{1}{3}

Answer: The probability of misclassifying $\mathbf{x}$ is approximately $\boxed{0.33}$ .

---

Problem-Solving Strategies

💡 GATE Strategy: Log-Probabilities for Stability

When a Naive Bayes problem involves many features, the product of their conditional probabilities, $\prod P(x_i | C_k)$ , can become extremely small, leading to floating-point underflow. To avoid this, always convert the decision rule to a sum of log-probabilities:

\hat{y} = \arg\max_{k} \left( \log(P(C_k)) + \sum_{i=1}^{n} \log(P(x_i | C_k)) \right)

This is numerically more stable and less prone to precision errors, which is critical in a time-constrained exam environment.

💡 GATE Strategy: Dealing with Zero Probabilities

A potential issue arises if a feature value in the test set was not seen in the training set for a particular class. This would make $P(x_i | C_k) = 0$ , causing the entire product for that class's posterior to become zero, regardless of the other features. To prevent this, use Laplace (or Additive) Smoothing.
Add a small constant $\alpha$ (usually $\alpha=1$ ) to the numerator (the count) and $d \times \alpha$ to the denominator, where $d$ is the number of possible values for the feature.
For a feature $x_i$ with $d$ possible values:

P(x_i=v | C_k) = \frac{\text{count}(x_i=v, C_k) + \alpha}{\text{count}(C_k) + d \cdot \alpha}

If a question mentions "Laplace smoothing," apply this formula.

---

Common Mistakes

⚠️ Avoid These Errors

❌ Ignoring Priors: Forgetting to multiply by the prior probability $P(C_k)$ and only comparing the likelihoods $\prod P(x_i | C_k)$ .

✅ Correct Approach: Always include the prior in the calculation:

\hat{y} = \arg\max_{k} P(C_k) \prod P(x_i | C_k)

The prior can significantly change the outcome, especially if classes are imbalanced.

❌ Miscalculating Parameters: Forgetting that for a set of $M$ probabilities that must sum to 1 (like class priors or probabilities for a categorical feature's values), only $M-1$ parameters are independent.

✅ Correct Approach: When counting parameters, remember to subtract 1 for each constrained set of probabilities. For

M

classes, there are

M-1

prior parameters. For a binary feature, there is 1 conditional probability parameter per class.

❌ Confusing Misclassification Probability: Stating the predicted class probability as the misclassification probability.

✅ Correct Approach: The misclassification probability is

1 - P(\text{predicted class} | \mathbf{x})

, which is equal to the sum of posterior probabilities of all other classes.

---

Practice Questions

:::question type="MCQ" question="For a multi-class classification problem with 4 classes and 10 features, where each feature can take one of 3 distinct categorical values, what is the total number of independent parameters required for a Naive Bayes classifier?" options=["123", "83", "120", "80"] answer="83" hint="Calculate parameters for priors and conditional probabilities separately. Remember for $N$ outcomes that sum to 1, there are $N-1$ independent parameters." solution="
Step 1: Calculate independent parameters for class priors.
There are 4 classes. The probabilities $P(C_1), P(C_2), P(C_3), P(C_4)$ must sum to 1.
So, the number of independent prior parameters is $4 - 1 = 3$ .

Step 2: Calculate independent parameters for conditional probabilities for one feature.
Each feature can take 3 distinct values. For a given class $C_k$ , the probabilities $P(X_i=v_1|C_k), P(X_i=v_2|C_k), P(X_i=v_3|C_k)$ must sum to 1.
So, for each feature and each class, there are $3 - 1 = 2$ independent parameters.

Step 3: Calculate total conditional probability parameters.
There are 10 features and 4 classes.
Total conditional parameters = (Number of features) $\times$ (Number of classes) $\times$ (Independent parameters per feature per class)
Total conditional parameters = $10 \times 4 \times 2 = 80$ .

Step 4: Calculate the total number of parameters.
Total parameters = (Prior parameters) + (Conditional parameters)
Total parameters = $3 + 80 = 83$ .
Answer: \boxed{83}
"
:::

:::question type="NAT" question="In a binary classification task (classes $Y=0, Y=1$ ), the prior probabilities are $P(Y=0)=0.6$ and $P(Y=1)=0.4$ . For a data point $x$ , the likelihoods are $P(x|Y=0)=0.3$ and $P(x|Y=1)=0.7$ . The Naive Bayes classifier predicts the class with the maximum a posteriori probability. What is the probability that this prediction is wrong? (Round off to two decimal places)" answer="0.39" hint="First, determine the predicted class using the MAP rule. Then, calculate the posterior probability of the other class." solution="
Step 1: Calculate the unnormalized posteriors (proportional to $P(x|Y)P(Y)$ ).

For class $Y=0$ :

\operatorname{Score}(Y=0) = P(x|Y=0)P(Y=0) = 0.3 \times 0.6 = 0.18

For class $Y=1$ :

\operatorname{Score}(Y=1) = P(x|Y=1)P(Y=1) = 0.7 \times 0.4 = 0.28

Step 2: Determine the predicted class.
Since $\operatorname{Score}(Y=1) > \operatorname{Score}(Y=0)$ , the classifier predicts class $Y=1$ .

Step 3: Calculate the evidence $P(x)$ .

P(x) = \operatorname{Score}(Y=0) + \operatorname{Score}(Y=1) = 0.18 + 0.28 = 0.46

Step 4: Calculate the probability of misclassification.
The prediction is $Y=1$ . The prediction is wrong if the true class is $Y=0$ . The probability of misclassification is therefore the posterior probability of the non-predicted class, $P(Y=0|x)$ .

P(\text{misclassification}) = P(Y=0|x) = \frac{P(x|Y=0)P(Y=0)}{P(x)}

P(\text{misclassification}) = \frac{0.18}{0.46} \approx 0.3913

Result:
Rounding to two decimal places, the probability of misclassification is 0.39.
Answer: \boxed{0.39}
"
:::

:::question type="MSQ" question="Which of the following statements about the Naive Bayes classifier are true?" options=["The 'naive' assumption implies that all features in the dataset are independent of each other.", "It is a generative model.", "The decision boundary learned by a Gaussian Naive Bayes classifier is always linear.", "Adding a feature that is a perfect copy of an existing feature will likely degrade its performance."] answer="B,D" hint="Carefully consider the definition of conditional independence. Think about how Naive Bayes models the data distribution. The decision boundary for GNB is quadratic in general. Consider how feature duplication violates the independence assumption." solution="

A is incorrect. The naive assumption is that features are conditionally independent given the class, not that they are marginally independent. For example, 'height' and 'weight' are not independent, but they might be considered conditionally independent given the class 'Male'.

B is correct. Naive Bayes is a generative model because it models the joint probability distribution $P(\mathbf{x}, C_k) = P(\mathbf{x}|C_k)P(C_k)$ . It learns how the data for each class is generated. In contrast, discriminative models like Logistic Regression directly model the posterior $P(C_k|\mathbf{x})$ .

C is incorrect. The decision boundary for a Gaussian Naive Bayes classifier is quadratic in the general case. It only becomes linear if the covariance matrices for all classes are assumed to be identical, which is not a standard assumption in GNB.

D is correct. Adding a perfect copy of a feature violates the conditional independence assumption. The model will "double-count" the evidence from that feature, giving it undue weight in the final probability calculation. This typically leads to overconfident and poorer predictions.

Answer: \boxed{B,D}
"
:::

:::question type="MCQ" question="A Naive Bayes classifier is used for spam detection. From a large dataset, it is estimated that the probability of an email being spam is $P(\text{Spam}) = 0.2$ . The word 'offer' is found to appear in 80% of spam emails and 1.25% of non-spam emails. Given a new email that contains the word 'offer', what is the probability that it is spam?" options=["0.80", "0.89", "0.94", "0.99"] answer="0.94" hint="Use Bayes' theorem: $P(A|B) = [P(B|A)P(A)] / P(B)$ . You need to calculate the evidence term $P(B)$ using the law of total probability." solution="
Step 1: Define the events and list the known probabilities.
Let $S$ be the event that an email is Spam, and $O$ be the event that an email contains the word 'offer'.
We are given:

Prior probability of Spam: $P(S) = 0.2$

Prior probability of Not Spam: $P(S') = 1 - P(S) = 0.8$

Likelihood of 'offer' given Spam: $P(O|S) = 0.80$

Likelihood of 'offer' given Not Spam: $P(O|S') = 0.0125$

We want to find the posterior probability

P(S|O)

Step 2: Use the law of total probability to find the evidence, $P(O)$ .

P(O) = P(O|S)P(S) + P(O|S')P(S')

P(O) = (0.80 \times 0.2) + (0.0125 \times 0.8)

P(O) = 0.16 + 0.01 = 0.17

Step 3: Apply Bayes' theorem to calculate $P(S|O)$ .

P(S|O) = \frac{P(O|S)P(S)}{P(O)}

P(S|O) = \frac{0.16}{0.17}

P(S|O) \approx 0.94117

Result:
The probability that the email is spam is approximately 0.94.
Answer: \boxed{0.94}
"
:::

---

Summary

❗ Key Takeaways for GATE

Core Principle: The Naive Bayes classifier is built on Bayes' theorem, combined with the simplifying (naive) assumption that all features are conditionally independent given the class. The decision rule is to select the class that maximizes the posterior probability:

\hat{y} = \arg\max_{k} P(C_k) \prod_{i=1}^{n} P(x_i | C_k)

Parameter Estimation: For an exam problem, you must be able to count the number of independent parameters. For a problem with $M$ classes and $K$ binary features, the total number of parameters is $(M-1) + (M \times K)$ . For the common binary case ( $M=2$ ), this simplifies to $1 + 2K$ .

Probability Calculation: Be proficient in calculating posterior probabilities and the probability of misclassification. Remember that the misclassification probability is $1 - P(\text{predicted class} | \mathbf{x})$ . Always account for both the likelihood and the prior probability in your calculations.

---

What's Next?

💡 Continue Learning

This topic connects to:

Logistic Regression: While Naive Bayes is a generative model, Logistic Regression is a discriminative model. Comparing their decision boundaries, assumptions, and performance characteristics is a common topic. Naive Bayes's conditional independence assumption is stricter than that of Logistic Regression.

Bayesian Networks: Naive Bayes can be viewed as a very simple Bayesian Network, a graphical model that represents probabilistic relationships among variables. Understanding Naive Bayes provides a foundation for these more complex and expressive generative models.

Master these connections for comprehensive GATE preparation!

---

💡 Moving Forward

Now that you understand Naive Bayes Classifier, let's explore Linear Discriminant Analysis (LDA) which builds on these concepts.

---

Part 4: Linear Discriminant Analysis (LDA)

Introduction

Linear Discriminant Analysis (LDA) is a classical and powerful method in supervised machine learning that serves a dual purpose: it can be employed for both dimensionality reduction and classification. As a dimensionality reduction technique, LDA projects a dataset onto a lower-dimensional space with the primary objective of maximizing the separability among categories or classes. This stands in stark contrast to Principal Component Analysis (PCA), an unsupervised method that focuses on maximizing the variance of the entire dataset without regard to class labels.

As a classifier, LDA establishes a linear decision boundary between classes. It operates on the principle of finding a projection that best separates the data by maximizing the ratio of between-class variance to within-class variance. This ensures that in the projected space, data points from the same class are clustered closely together, while the clusters corresponding to different classes are as far apart as possible. For the GATE examination, a thorough understanding of the mathematical formulation of LDA, particularly the scatter matrices and the resulting optimization problem, is of paramount importance.

📖 Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a supervised learning algorithm that finds a linear combination of features, known as a discriminant, that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier or, more commonly, for dimensionality reduction before later classification. The central objective is to find a projection vector $w$ that maximizes the ratio of the projected between-class scatter to the projected within-class scatter.

---

Key Concepts

1. The Objective of LDA: Maximizing Class Separability

The core intuition behind LDA is to find a lower-dimensional representation of the data that preserves the maximum amount of class-discriminatory information. Let us consider a dataset with $C$ classes. We seek a projection vector $w$ that maps a high-dimensional data point $x \in \mathbb{R}^d$ to a single value $y = w^T x$ .

The goal is to select $w$ such that the projected points are well-separated. This can be achieved by satisfying two criteria simultaneously:

The distance between the means of the projected classes should be maximized.

The variance (or scatter) of the projected points within each class should be minimized.

The following diagram illustrates this concept. Projecting the data onto the vector $w_1$ results in significant overlap between the two classes. In contrast, projecting onto the vector $w_2$ (the LDA direction) achieves excellent separation.

Feature 2
Feature 1

w_1

w_2 (LDA)

LDA finds the projection that maximizes class separation

To formalize this, we introduce the concepts of between-class and within-class scatter matrices.

---

2. Scatter Matrices

The notions of "distance between means" and "variance within classes" are quantified using scatter matrices. Let us assume we have a dataset with $C$ classes. Let $N_k$ be the number of samples in class $C_k$ .

The mean vector for class $k$ is given by:

\mu_k = \frac{1}{N_k} \sum_{x_i \in C_k} x_i

The overall mean vector of the entire dataset is:

\mu = \frac{1}{N} \sum_{i=1}^{N} x_i = \frac{1}{N} \sum_{k=1}^{C} N_k \mu_k

Within-Class Scatter Matrix ( $S_W$ )

The within-class scatter matrix measures the scatter of data points around their respective class means. It is the sum of the covariance matrices for each class.

📐 Within-Class Scatter Matrix (

S_W

)

S_W = \sum_{k=1}^{C} S_k

where

S_k

is the scatter matrix for class

k

S_k = \sum_{x_i \in C_k} (x_i - \mu_k)(x_i - \mu_k)^T

Variables:

$C$ = Number of classes

$\mu_k$ = Mean vector of class $k$

$x_i$ = Data point belonging to class $k$

Application: This matrix quantifies the total variance within all classes. A smaller projected value of

S_W

is desirable.

Between-Class Scatter Matrix ( $S_B$ )

The between-class scatter matrix measures the scatter of the class means around the overall dataset mean, weighted by the number of points in each class.

📐 Between-Class Scatter Matrix (

S_B

)

S_B = \sum_{k=1}^{C} N_k (\mu_k - \mu)(\mu_k - \mu)^T

Variables:

$C$ = Number of classes

$N_k$ = Number of samples in class $k$

$\mu_k$ = Mean vector of class $k$

$\mu$ = Overall mean vector of the dataset

Application: This matrix quantifies the separation between classes. A larger projected value of

S_B

is desirable.

Worked Example:

Problem: Consider a 2D dataset with two classes.
Class 1 ( $C_1$ ): $x_1 = \begin{bmatrix} 2 \\ 3 \end{bmatrix}$ , $x_2 = \begin{bmatrix} 3 \\ 3 \end{bmatrix}$
Class 2 ( $C_2$ ): $x_3 = \begin{bmatrix} 6 \\ 7 \end{bmatrix}$ , $x_4 = \begin{bmatrix} 7 \\ 7 \end{bmatrix}$
Calculate the within-class scatter matrix $S_W$ and the between-class scatter matrix $S_B$ .

Solution:

Step 1: Calculate class means and the overall mean.

\mu_1 = \frac{1}{2} \left( \begin{bmatrix} 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 3 \\ 3 \end{bmatrix} \right) = \begin{bmatrix} 2.5 \\ 3 \end{bmatrix}

\mu_2 = \frac{1}{2} \left( \begin{bmatrix} 6 \\ 7 \end{bmatrix} + \begin{bmatrix} 7 \\ 7 \end{bmatrix} \right) = \begin{bmatrix} 6.5 \\ 7 \end{bmatrix}

\mu = \frac{1}{4} \left( \begin{bmatrix} 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 3 \\ 3 \end{bmatrix} + \begin{bmatrix} 6 \\ 7 \end{bmatrix} + \begin{bmatrix} 7 \\ 7 \end{bmatrix} \right) = \begin{bmatrix} 4.5 \\ 5 \end{bmatrix}

Step 2: Calculate the scatter matrix for each class, $S_1$ and $S_2$ .

(x_1 - \mu_1) = \begin{bmatrix} 2 - 2.5 \\ 3 - 3 \end{bmatrix} = \begin{bmatrix} -0.5 \\ 0 \end{bmatrix}

(x_2 - \mu_1) = \begin{bmatrix} 3 - 2.5 \\ 3 - 3 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0 \end{bmatrix}

S_1 = \begin{bmatrix} -0.5 \\ 0 \end{bmatrix} \begin{bmatrix} -0.5 & 0 \end{bmatrix} + \begin{bmatrix} 0.5 \\ 0 \end{bmatrix} \begin{bmatrix} 0.5 & 0 \end{bmatrix} = \begin{bmatrix} 0.25 & 0 \\ 0 & 0 \end{bmatrix} + \begin{bmatrix} 0.25 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 0.5 & 0 \\ 0 & 0 \end{bmatrix}

(x_3 - \mu_2) = \begin{bmatrix} 6 - 6.5 \\ 7 - 7 \end{bmatrix} = \begin{bmatrix} -0.5 \\ 0 \end{bmatrix}

(x_4 - \mu_2) = \begin{bmatrix} 7 - 6.5 \\ 7 - 7 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0 \end{bmatrix}

S_2 = \begin{bmatrix} -0.5 \\ 0 \end{bmatrix} \begin{bmatrix} -0.5 & 0 \end{bmatrix} + \begin{bmatrix} 0.5 \\ 0 \end{bmatrix} \begin{bmatrix} 0.5 & 0 \end{bmatrix} = \begin{bmatrix} 0.25 & 0 \\ 0 & 0 \end{bmatrix} + \begin{bmatrix} 0.25 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 0.5 & 0 \\ 0 & 0 \end{bmatrix}

Step 3: Calculate the within-class scatter matrix $S_W$ .

S_W = S_1 + S_2 = \begin{bmatrix} 0.5 & 0 \\ 0 & 0 \end{bmatrix} + \begin{bmatrix} 0.5 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}

Step 4: Calculate the between-class scatter matrix $S_B$ .
Here, $N_1 = 2$ and $N_2 = 2$ .

(\mu_1 - \mu) = \begin{bmatrix} 2.5 - 4.5 \\ 3 - 5 \end{bmatrix} = \begin{bmatrix} -2 \\ -2 \end{bmatrix}

(\mu_2 - \mu) = \begin{bmatrix} 6.5 - 4.5 \\ 7 - 5 \end{bmatrix} = \begin{bmatrix} 2 \\ 2 \end{bmatrix}

S_B = N_1 (\mu_1 - \mu)(\mu_1 - \mu)^T + N_2 (\mu_2 - \mu)(\mu_2 - \mu)^T

S_B = 2 \begin{bmatrix} -2 \\ -2 \end{bmatrix} \begin{bmatrix} -2 & -2 \end{bmatrix} + 2 \begin{bmatrix} 2 \\ 2 \end{bmatrix} \begin{bmatrix} 2 & 2 \end{bmatrix}

S_B = 2 \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} + 2 \begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix} = \begin{bmatrix} 8 & 8 \\ 8 & 8 \end{bmatrix} + \begin{bmatrix} 8 & 8 \\ 8 & 8 \end{bmatrix} = \begin{bmatrix} 16 & 16 \\ 16 & 16 \end{bmatrix}

Answer:

S_W = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \quad \text{and} \quad S_B = \begin{bmatrix} 16 & 16 \\ 16 & 16 \end{bmatrix}

---

3. Fisher's Linear Discriminant and the Optimization Problem

With the scatter matrices defined, we can now formulate the objective of LDA precisely. For a given projection vector $w$ , the scatter of the projected data is a scalar value.

The projected between-class scatter is $w^T S_B w$ .

The projected within-class scatter is $w^T S_W w$ .

LDA seeks the projection vector

w

that maximizes the ratio of these two quantities. This ratio is known as Fisher's criterion.

📐 Fisher's Criterion

J(w) = \frac{\text{Projected Between-Class Scatter}}{\text{Projected Within-Class Scatter}} = \frac{w^T S_B w}{w^T S_W w}

Variables:

$w$ = The projection vector

$S_B$ = The between-class scatter matrix

$S_W$ = The within-class scatter matrix

When to use: This is the central objective function for LDA. Questions in GATE may refer to this criterion and its maximization.

To find the optimal $w$ that maximizes $J(w)$ , we must compute the derivative of $J(w)$ with respect to $w$ and set it to zero.

\frac{dJ(w)}{dw} = 0

Applying the quotient rule for matrix derivatives, we have:

\frac{d}{dw} \left( \frac{w^T S_B w}{w^T S_W w} \right) = \frac{(2 S_B w)(w^T S_W w) - (w^T S_B w)(2 S_W w)}{(w^T S_W w)^2} = 0

This simplifies to:

(S_B w)(w^T S_W w) - (w^T S_B w)(S_W w) = 0

Dividing by the scalar term $w^T S_W w$ (which we assume is non-zero), we get:

S_B w - \frac{w^T S_B w}{w^T S_W w} S_W w = 0

Recognizing that the ratio is simply our objective function $J(w)$ , we can substitute it with a scalar $\lambda$ :

S_B w - \lambda S_W w = 0

This leads to the fundamental equation of LDA:

S_B w = \lambda S_W w

This is a generalized eigenvalue problem. If the within-class scatter matrix $S_W$ is non-singular (invertible), we can rewrite the equation as a standard eigenvalue problem.

S_W^{-1} S_B w = \lambda w

❗ Must Remember

The optimal projection vector $w^$ for Linear Discriminant Analysis is the eigenvector corresponding to the largest eigenvalue $\lambda$ of the matrix $S_W^{-1} S_B$ . The value of the maximized Fisher's criterion, $J(w^$ ) $J (w^{*})$ , is equal to this largest eigenvalue.

This result is precisely the concept tested in competitive exams like GATE. You must be able to recognize this equation and its relationship to the maximization of $J(w)$ .

---

4. LDA as a Classifier

After finding the optimal projection vector $w^$ , we can use it to classify new, unseen data points. The projection of a new point $x$ is given by $y = (w^$ )^T x $y = (w^{*})^{T} x$ . A decision boundary, or threshold, must be established on this 1D line to separate the classes.

A common choice for the threshold $b$ in a two-class problem is the midpoint of the projected class means:

b = \frac{1}{2} ((w^

A new point $x$ is then classified into class 1 if $(w^$ is closer to $(w^$ )^T \mu_1 $(w^{*})^{T} μ_{1}$ , and into class 2 otherwise. This leads to a decision function $f(x)$ :

f(x) = (w^*)^T x - b

The point is assigned to class 1 if $f(x)$ is on one side of zero and to class 2 if it is on the other. This is a linear decision rule, and the decision boundary where $f(x)=0$ is a hyperplane.

Let us now consider a related classifier based on minimum Euclidean distance to class means, which reveals a deep connection to LDA's linear nature. Suppose a classifier assigns a point $x$ to the class with the closest mean. For a two-class problem, we compare $||\mu_1 - x||^2$ and $||\mu_2 - x||^2$ . The decision function can be written as:

g(x) = ||\mu_1 - x||^2 - ||\mu_2 - x||^2

A point is assigned to class 1 if $g(x) < 0$ and class 2 if $g(x) > 0$ . Let us expand this expression:

g(x) = (\mu_1 - x)^T(\mu_1 - x) - (\mu_2 - x)^T(\mu_2 - x)

g(x) = (\mu_1^T \mu_1 - 2\mu_1^T x + x^T x) - (\mu_2^T \mu_2 - 2\mu_2^T x + x^T x)

We observe that the quadratic term $x^T x$ cancels out, which is a critical insight.

g(x) = (\mu_1^T \mu_1 - \mu_2^T \mu_2) - 2(\mu_1^T - \mu_2^T)x

Rearranging this into the standard linear form $w^T x + b$ :

g(x) = 2(\mu_2 - \mu_1)^T x + (\mu_1^T \mu_1 - \mu_2^T \mu_2)

This demonstrates that a classifier based on minimum Euclidean distance to the mean is a linear classifier. The weight vector is $w_{cls} = 2(\mu_2 - \mu_1)$ and the bias term is $b_{cls} = \mu_1^T \mu_1 - \mu_2^T \mu_2$ . This form of classifier is equivalent to LDA under the assumption that the class covariances are equal and spherical (i.e., $\Sigma_1 = \Sigma_2 = \sigma^2 I$ ).

---

Problem-Solving Strategies

💡 GATE Strategy

Identify the Core Task: When a question involves maximizing a ratio of quadratic forms like $\frac{u^T A u}{u^T B u}$ , immediately recognize it as a generalized eigenvalue problem. The solution will involve the eigenvectors of $B^{-1}A$ .
Simplify Decision Functions: For classification problems involving distances or norms, always expand the expressions algebraically. In linear models like LDA, quadratic terms in the input variable $x$ (e.g., $||x||^2$ or $x^T x$ ) will typically cancel, revealing an underlying linear function of the form $w^T x + b$ .
Distinguish $S_W$ and $S_B$ : Be meticulous when calculating scatter matrices. $S_W$ involves deviations from class-specific means ( $\mu_k$ ), while $S_B$ involves deviations of class means from the overall mean ( $\mu$ ).

---

Common Mistakes

⚠️ Avoid These Errors

❌ Confusing LDA with PCA: PCA is an unsupervised method that finds directions of maximum variance in the entire dataset. LDA is a supervised method that finds directions of maximum class separability. Their objectives are fundamentally different.
❌ Mathematical Errors in Expansion: When simplifying a decision function like $||\mu_1 - x||^2 - ||\mu_2 - x||^2$ , a common mistake is to forget the cross-term ( $-2\mu^T x$ ).

- ❌

||\mu - x||^2 = \mu^T \mu + x^T x

- ✅

||\mu - x||^2 = \mu^T \mu - 2\mu^T x + x^T x

❌ Incorrectly Forming the Eigenvalue Problem: Remember the correct form is $S_W^{-1} S_B w = \lambda w$ . A frequent error is to use $S_B^{-1} S_W$ or other incorrect combinations. The matrix that quantifies within-class scatter ( $S_W$ ) is the one that is inverted.

---

Practice Questions

:::question type="MCQ" question="In a binary classification problem, the between-class scatter matrix is

S_B = \begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix}

and the within-class scatter matrix is

S_W = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}

. The optimal projection vector

w^*

for LDA is an eigenvector of which of the following matrices?" options=["

\begin{bmatrix} 2 & 1 \\ 1 & 0.5 \end{bmatrix}

","

\begin{bmatrix} 0.5 & -0.25 \\ -0.25 & 0.125 \end{bmatrix}

","

\begin{bmatrix} 8 & 4 \\ 4 & 2 \end{bmatrix}

","

\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

"] answer="

\begin{bmatrix} 2 & 1 \\ 1 & 0.5 \end{bmatrix}

" hint="The optimal projection vector is an eigenvector of the matrix

S_W^{-1}S_B

." solution="
Step 1: The optimal projection vector

w^*

is the eigenvector of the matrix

M = S_W^{-1}S_B

corresponding to the largest eigenvalue.

Step 2: First, we need to find the inverse of $S_W$ .

S_W = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}

S_W^{-1} = \frac{1}{(2)(2) - (0)(0)} \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} = \frac{1}{4} \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix}

Step 3: Now, we compute the product $M = S_W^{-1}S_B$ .

M = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix} \begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix}

M = \begin{bmatrix} (0.5)(4) + (0)(2) & (0.5)(2) + (0)(1) \\ (0)(4) + (0.5)(2) & (0)(2) + (0.5)(1) \end{bmatrix}

M = \begin{bmatrix} 2 & 1 \\ 1 & 0.5 \end{bmatrix}

Result: The optimal projection vector is an eigenvector of the matrix

\begin{bmatrix} 2 & 1 \\ 1 & 0.5 \end{bmatrix}

.
Answer: \boxed{\begin{bmatrix} 2 & 1 \\ 1 & 0.5 \end{bmatrix}}
"
:::

:::question type="NAT" question="Consider a dataset with two classes. The mean of Class 1 is $\mu_1 = [1, 2]^T$ and the mean of Class 2 is $\mu_2 = [5, 4]^T$ . A linear classifier uses the decision function $f(x) = 2(\mu_2 - \mu_1)^T x + b$ . For a test sample $x = [3, 3]^T$ , the value of the term $2(\mu_2 - \mu_1)^T x$ is:" answer="36" hint="First, calculate the vector difference between the means. Then compute the dot product and multiply by 2." solution="
Step 1: Calculate the difference between the mean vectors.

\mu_2 - \mu_1 = \begin{bmatrix} 5 \\ 4 \end{bmatrix} - \begin{bmatrix} 1 \\ 2 \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix}

Step 2: Compute the term $2(\mu_2 - \mu_1)^T x$ .

2 \times \begin{bmatrix} 4 & 2 \end{bmatrix} \begin{bmatrix} 3 \\ 3 \end{bmatrix}

Step 3: Perform the matrix multiplication (dot product).

2 \times ((4 \times 3) + (2 \times 3))

2 \times (12 + 6)

2 \times 18

Step 4: Calculate the final result.

36

Result: The value is 36.
Answer: \boxed{36}
"
:::

:::question type="MSQ" question="A binary classifier's decision function is given by $f(x) = ||x - c_1||^2 - ||x - c_2||^2$ , where $x, c_1, c_2 \in \mathbb{R}^d$ . The label is assigned based on the sign of $f(x)$ . Which of the following statements is/are ALWAYS true?" options=["The decision boundary $f(x)=0$ is a hyperplane.","The function $f(x)$ is a quadratic function of $x$ .","The weight vector of the linear decision boundary is proportional to $(c_2 - c_1)$ .","If $c_1 = -c_2$ , the decision boundary passes through the origin."] answer="The decision boundary $f(x)=0$ is a hyperplane.,The weight vector of the linear decision boundary is proportional to $(c_2 - c_1)$ .,If $c_1 = -c_2$ , the decision boundary passes through the origin." hint="Expand the squared norm terms and analyze the resulting expression for $f(x)$ ." solution="
Analysis of the function $f(x)$ :
Let's expand the squared Euclidean distance terms.

f(x) = (x - c_1)^T(x - c_1) - (x - c_2)^T(x - c_2)

f(x) = (x^Tx - 2c_1^Tx + c_1^Tc_1) - (x^Tx - 2c_2^Tx + c_2^Tc_2)

The

x^Tx

terms cancel out.

f(x) = -2c_1^Tx + c_1^Tc_1 + 2c_2^Tx - c_2^Tc_2

f(x) = 2(c_2 - c_1)^T x + (c_1^Tc_1 - c_2^Tc_2)

This is a linear function of

x

in the form

w^Tx + b

, where

w = 2(c_2 - c_1)

and

b = c_1^Tc_1 - c_2^Tc_2

Evaluating the options:

"The decision boundary $f(x)=0$ is a hyperplane."

Since

f(x)

is a linear function of

x

, the equation

f(x)=0

defines a hyperplane. This statement is correct.

"The function $f(x)$ is a quadratic function of $x$ ."

After expansion, the quadratic term

x^Tx

cancels out, leaving a linear function. This statement is incorrect.

"The weight vector of the linear decision boundary is proportional to $(c_2 - c_1)$ ."

The weight vector is

w = 2(c_2 - c_1)

. This is directly proportional to

(c_2 - c_1)

. This statement is correct.

"If $c_1 = -c_2$ , the decision boundary passes through the origin."

The decision boundary is

f(x)=0

. A point is on the boundary if

w^Tx + b = 0

. The origin is the point

x=0

.
If

x=0

, then

f(0) = b = c_1^Tc_1 - c_2^Tc_2

.
If

c_1 = -c_2

, then

c_1^Tc_1 = (-c_2)^T(-c_2) = c_2^Tc_2

.
Therefore,

b = c_2^Tc_2 - c_2^Tc_2 = 0

.
Since the bias term

b

is zero, the decision boundary equation becomes

w^Tx=0

, which is a hyperplane that passes through the origin. This statement is correct.
Answer: \boxed{The decision boundary

f(x)=0

is a hyperplane.,The weight vector of the linear decision boundary is proportional to

(c_2 - c_1)

.,If

c_1 = -c_2

, the decision boundary passes through the origin.}
"
:::

---

Summary

❗ Key Takeaways for GATE

LDA's Objective: LDA is a supervised algorithm that finds a projection vector $w$ to maximize class separability. It achieves this by maximizing Fisher's criterion,

J(w) = \frac{w^T S_B w}{w^T S_W w}

The Eigenvalue Problem: The maximization of Fisher's criterion leads to the generalized eigenvalue problem

S_B w = \lambda S_W w

w^*

S_W^{-1} S_B

Linearity: LDA produces a linear classifier. The decision boundary is a hyperplane. Decision functions based on squared Euclidean distances to class means, such as

||\mu_1 - x||^2 - ||\mu_2 - x||^2

simplify to linear functions of the form

w^T x + b

---

What's Next?

💡 Continue Learning

This topic connects to:

Principal Component Analysis (PCA): It is crucial to contrast LDA's supervised, class-based objective with PCA's unsupervised, variance-maximization objective. Many conceptual questions revolve around this difference.

Logistic Regression: As another fundamental linear classification model, Logistic Regression provides a probabilistic approach to classification. Comparing its discriminative model with LDA's generative assumptions will deepen your understanding of linear classifiers.

Support Vector Machines (SVM): SVMs offer another perspective on linear classification by finding the hyperplane that maximizes the margin between classes. Understanding the difference between LDA's mean-and-variance-based separation and SVM's margin-based separation is key.

Master these connections for comprehensive GATE preparation!

---

💡 Moving Forward

Now that you understand Linear Discriminant Analysis (LDA), let's explore Support Vector Machine (SVM) which builds on these concepts.

---

Part 5: Support Vector Machine (SVM)

Introduction

The Support Vector Machine (SVM) is a powerful and versatile supervised machine learning algorithm used for both classification and regression tasks. At its core, particularly in the context of binary classification, the SVM seeks to find an optimal separating hyperplane that not only correctly categorizes the training data but also maintains the maximum possible distance, or margin, from the nearest data points of each class. This principle of maximizing the margin is fundamental to the algorithm's robustness and strong generalization performance.

For the GATE examination, a thorough understanding of the linear SVM is paramount. This includes the concepts of linear separability, the definition of the margin, the role of support vectors, and the mathematical formulation of the optimization problem that the SVM solves. We will explore both the ideal case of perfectly separable data (hard-margin SVM) and the more practical scenario involving overlapping classes (soft-margin SVM).

📖 Support Vector Machine (SVM)

A Support Vector Machine is a supervised learning model that constructs a hyperplane or a set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification. A good separation is achieved by the hyperplane that has the largest functional distance to the nearest training data points of any class (functional margin), since in general the larger the margin, the lower the generalization error of the classifier.

---

Key Concepts

1. The Maximal Margin Classifier and Linear Separability

Let us begin by considering a binary classification problem where the data points are linearly separable. This means that we can draw a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that perfectly separates the data points of the two classes.

📖 Hyperplane

In a $p$ -dimensional space, a hyperplane is a flat affine subspace of dimension $p-1$ . For a two-dimensional feature space ( $p=2$ ), the hyperplane is a line. The equation of a hyperplane is given by:

w^T x + b = 0

where $w$ is a $p$ -dimensional weight vector (normal to the hyperplane) and $b$ is a scalar bias term.

For any given linearly separable dataset, there exist infinitely many hyperplanes that can separate the two classes. The central idea of SVM is to select the one that is optimal. The optimal hyperplane is defined as the one that maximizes the margin, which is the distance between the hyperplane and the closest data points from either class. This is known as the maximal margin classifier.

The decision rule for a new data point $x_{new}$ is based on the sign of $f(x) = w^T x + b$ :

If $w^T x_{new} + b > 0$ , we classify it as class +1.

If $w^T x_{new} + b < 0$ , we classify it as class -1.

Linearly Separable vs. Non-Separable Data

Linearly Separable

Non-Linearly Separable

The two parallel hyperplanes that define the boundaries of the margin are given by:

$w^T x + b = 1$ (for the positive class)

$w^T x + b = -1$ (for the negative class)

The region between these two hyperplanes is the margin. The width of this margin can be shown to be

\frac{2}{\|w\|}

. Therefore, maximizing the margin is equivalent to minimizing

\|w\|

, or more conveniently, minimizing

\frac{1}{2}\|w\|^2

.
---

---

2. Support Vectors

The data points that lie exactly on the margin boundaries (i.e., the points for which $w^T x + b = 1$ or $w^T x + b = -1$ ) are called support vectors.

These points are critical because they alone define the position and orientation of the optimal hyperplane. If we were to remove any data point that is not a support vector, the optimal hyperplane would not change. Conversely, moving a support vector would almost certainly change the hyperplane. This property makes the SVM algorithm computationally efficient, as it only needs to consider a subset of the data for defining the decision boundary.

wᵀx + b = 0

wᵀx + b = 1

wᵀx + b = -1

Support Vectors

Margin

---

3. Mathematical Formulation (Hard-Margin SVM)

For a dataset $\{ (x_i, y_i) \}_{i=1}^n$ where $x_i \in \mathbb{R}^p$ and $y_i \in \{-1, 1\}$ , the hard-margin linear SVM aims to solve the following constrained optimization problem.

📐 Hard-Margin SVM Optimization

\min_{w, b} \frac{1}{2} \|w\|^2

subject to:

y_i(w^T x_i + b) \ge 1, \quad \text{for all } i=1, \dots, n

Variables:

$w$ : The weight vector, normal to the separating hyperplane.

$b$ : The bias term, an offset.

$x_i$ : The $i$ -th feature vector.

$y_i$ : The class label of the $i$ -th data point ( $+1$ or $-1$ ).

When to use: This formulation is used when the training data is perfectly linearly separable.

The constraint $y_i(w^T x_i + b) \ge 1$ is a compact way of ensuring that all data points are classified correctly and lie on or outside the respective margin boundary.

If $y_i = +1$ , the constraint becomes $w^T x_i + b \ge 1$ .

If $y_i = -1$ , the constraint becomes $-(w^T x_i + b) \ge 1$ , which is equivalent to $w^T x_i + b \le -1$ .

The geometric margin, which is the actual distance from a point to the hyperplane, is

\frac{y_i(w^T x_i + b)}{\|w\|}

. By setting the functional margin

y_i(w^T x_i + b)

to be at least 1 for all points, the width of the margin street becomes

\frac{2}{\|w\|}

. Minimizing

\frac{1}{2}\|w\|^2

is therefore equivalent to maximizing this margin.

Worked Example:

Problem: Consider a dataset with three support vectors:

Class +1: $x_1 = [2, 2]^T$

Class -1: $x_2 = [1, 0]^T$ , $x_3 = [0, 1]^T$

Find the optimal hyperplane parameters

w

and

b

, and calculate the margin.

Solution:

Step 1: Set up the equations for the support vectors.
Since these are support vectors, they must lie exactly on the margin boundaries.

For $x_1$ (class +1):

w^T x_1 + b = 1 \implies w_1(2) + w_2(2) + b = 1

For $x_2$ (class -1):

w^T x_2 + b = -1 \implies w_1(1) + w_2(0) + b = -1 \implies w_1 + b = -1

For $x_3$ (class -1):

w^T x_3 + b = -1 \implies w_1(0) + w_2(1) + b = -1 \implies w_2 + b = -1

Step 2: Solve the system of linear equations.
From the second and third equations, we can see that $w_1 = w_2$ . Let's call this value $w_c$ .
So, $w_c + b = -1$ , which gives $b = -1 - w_c$ .

Step 3: Substitute into the first equation.
Substitute $w_1 = w_c$ , $w_2 = w_c$ , and $b = -1 - w_c$ into the first equation:

2w_c + 2w_c + (-1 - w_c) = 1

4w_c - 1 - w_c = 1

3w_c = 2

w_c = \frac{2}{3}

Step 4: Determine the final values of $w$ and $b$ .
Since $w_1 = w_2 = w_c$ , we have:

w = \begin{bmatrix} 2/3 \\ 2/3 \end{bmatrix}

Now, find $b$ :

b = -1 - w_c = -1 - \frac{2}{3} = -\frac{5}{3}

Step 5: Calculate the margin.
The margin is given by the formula $\frac{2}{\|w\|}$ .

First, calculate $\|w\|$ :

\|w\| = \sqrt{w_1^2 + w_2^2} = \sqrt{\left(\frac{2}{3}\right)^2 + \left(\frac{2}{3}\right)^2} = \sqrt{\frac{4}{9} + \frac{4}{9}} = \sqrt{\frac{8}{9}} = \frac{2\sqrt{2}}{3}

Now, the margin is:

\text{Margin} = \frac{2}{\|w\|} = \frac{2}{2\sqrt{2}/3} = \frac{3}{\sqrt{2}} = \frac{3\sqrt{2}}{2}

Answer: The optimal hyperplane is defined by $w = [2/3, 2/3]^T$ and $b = -5/3$ . The margin is $\frac{3\sqrt{2}}{2}$ .

---

4. The Soft-Margin Classifier

The hard-margin SVM has a significant limitation: it requires the data to be perfectly linearly separable. In most real-world scenarios, classes overlap. Furthermore, the hard-margin approach is highly sensitive to outliers. A single outlier can drastically alter the resulting hyperplane.

To address this, we introduce the soft-margin SVM. This formulation allows some data points to be on the wrong side of the margin, or even on the wrong side of the hyperplane (i.e., misclassified). This is achieved by introducing non-negative slack variables, $\xi_i \ge 0$ , for each data point.

If $\xi_i = 0$ , the point is correctly classified and is on or outside the margin boundary.
If $0 < \xi_i \le 1$ , the point is correctly classified but lies inside the margin.
If $\xi_i > 1$ , the point is misclassified.

The optimization problem is modified to penalize these violations.

📐 Soft-Margin SVM Optimization

\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i

subject to:

y_i(w^T x_i + b) \ge 1 - \xi_i, \quad \text{and} \quad \xi_i \ge 0, \quad \text{for all } i=1, \dots, n

Variables:

$w, b, x_i, y_i$ : Same as hard-margin SVM.

$\xi_i$ : Slack variable for the $i$ -th data point.

$C$ : A non-negative regularization hyperparameter.

When to use: When data is not linearly separable or when a more robust solution, less sensitive to outliers, is desired.

The parameter $C$ controls the trade-off between maximizing the margin and minimizing the classification error (represented by $\sum \xi_i$ ).

A small $C$ value results in a wider margin but tolerates more margin violations. This can lead to a simpler model that may underfit.

A large $C$ value places a high penalty on violations, forcing the model to classify as many points correctly as possible. This leads to a narrower margin and can result in overfitting, as the model becomes highly sensitive to individual data points.

---

5. The Kernel Trick for Non-linear Data

What if the decision boundary is inherently non-linear? SVM can handle this using the kernel trick. The core idea is to map the original input features into a higher-dimensional feature space where the data becomes linearly separable. A linear hyperplane is then found in this new, higher-dimensional space.

The "trick" is that we do not need to explicitly compute the transformation of the data points. Instead, kernel functions allow us to compute the dot products between the transformed vectors in the high-dimensional space directly, using only the original vectors. This is computationally much more efficient.

Common kernel functions include:

Polynomial Kernel: $K(x_i, x_j) = (x_i^T x_j + c)^d$

Radial Basis Function (RBF) Kernel: $K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)$

The use of kernels makes SVM an extremely powerful and flexible classifier, capable of learning complex, non-linear decision boundaries.

---

Problem-Solving Strategies

💡 GATE Strategy

Check for Linear Separability: For 2D data, quickly sketch the points. If you can draw a single straight line to separate the classes, the data is linearly separable, and a hard-margin SVM is applicable.

Identify Candidate Support Vectors: The support vectors will always be the points from each class that are closest to the opposing class. Visually identify these points first. In a typical GATE problem, there will be 2 or 3 support vectors that define the hyperplane.

Use Support Vectors to Solve for $w$ and $b$ : If you are given candidate support vectors, plug them into the margin equations ( $w^T x + b = 1$ for class +1, $w^T x + b = -1$ for class -1). This creates a system of linear equations that you can solve for $w$ and $b$ .

Verify the Solution: Once you have $w$ and $b$ , ensure that for all data points (not just the support vectors), the condition $y_i(w^T x_i + b) \ge 1$ holds. If it doesn't, your assumed support vectors were incorrect.

Calculate Margin Quickly: The margin is always $\frac{2}{\|w\|}$ . This is a very common calculation. Remember that $\|w\| = \sqrt{w_1^2 + w_2^2 + \dots + w_p^2}$ .

---

Common Mistakes

⚠️ Avoid These Errors

❌ Confusing Margin Width: Thinking the margin is $\frac{1}{\|w\|}$ .

✅ The margin is the total width of the "street," which is

\frac{2}{\|w\|}

. The distance from the hyperplane to the margin boundary is

\frac{1}{\|w\|}

❌ Assuming Symmetrical Support Vectors: Believing there must be an equal number of support vectors from each class.

✅ This is not necessary. It is common to have, for instance, one support vector from the positive class and two from the negative class.

❌ Ignoring Non-Support Vectors: Solving for $w$ and $b$ using candidate support vectors but failing to check if the solution correctly classifies all other points according to the hard-margin constraint $y_i(w^T x_i + b) \ge 1$ .

✅ Always verify your final hyperplane against all data points in the training set.

---

Practice Questions

:::question type="MCQ" question="A dataset is considered linearly separable if:" options=["All data points lie on a single straight line.","A hyperplane can be drawn such that all points of one class lie on one side of it, and all points of the other class lie on the other side.","The mean of the two classes is different.","The data can be clustered into two distinct groups using K-Means."] answer="A hyperplane can be drawn such that all points of one class lie on one side of it, and all points of the other class lie on the other side." hint="Recall the fundamental condition for applying a hard-margin SVM." solution="Linear separability is the property of a dataset where a single hyperplane (a line in 2D, a plane in 3D, etc.) can perfectly separate the data points belonging to different classes. The other options are incorrect: points lying on a single line is a degenerate case, different means do not guarantee separability, and K-Means is an unsupervised clustering algorithm, not a test for linear separability in a supervised context."
:::

:::question type="NAT" question="A hard-margin SVM classifier is trained and the resulting weight vector is $w = [3, -4]^T$ . What is the width of the margin?" answer="0.8" hint="The margin is calculated as $2/\|w\|$ . First, compute the L2-norm of the weight vector." solution="Step 1: The formula for the margin of an SVM is $\frac{2}{\|w\|}$ .

Step 2: Calculate the L2-norm (magnitude) of the weight vector $w = [3, -4]^T$ .

\|w\| = \sqrt{3^2 + (-4)^2}

\|w\| = \sqrt{9 + 16}

\|w\| = \sqrt{25}

\|w\| = 5

Step 3: Calculate the margin.

\text{Margin} = \frac{2}{\|w\|} = \frac{2}{5}

Result:

\text{Margin} = 0.8

" :::

:::question type="MSQ" question="A hard-margin linear SVM is trained on the following 2D dataset: Class +1: $\{(4, 4)\}$ , Class -1: $\{(2, 0), (0, 2)\}$ . Which of the following statements is/are correct?" options=["The weight vector $w$ is $[1, 1]^T$ .","The bias term $b$ is $-3$ .","The number of support vectors is 3.","The margin is $\sqrt{2}$ ."] answer="A,B,C,D" hint="Assume all three points are support vectors and solve the system of equations $y_i(w^Tx_i+b)=1$ . Then verify all results." solution="
Step 1: Determine the canonical weight vector $w$ and bias $b$ .
The data points are: Class +1: $x_1 = [4, 4]^T$ ; Class -1: $x_2 = [2, 0]^T$ , $x_3 = [0, 2]^T$ .
Assuming these are the support vectors, they must satisfy $y_i(w^T x_i + b) = 1$ .
1) $w^T [4, 4]^T + b = 1 \implies 4w_1 + 4w_2 + b = 1$
2) $-(w^T [2, 0]^T + b) = 1 \implies 2w_1 + b = -1$
3) $-(w^T [0, 2]^T + b) = 1 \implies 2w_2 + b = -1$

From (2) and (3), $2w_1 = 2w_2 \implies w_1 = w_2$ .
Substitute $w_1 = w_2$ into (1): $8w_1 + b = 1$ .
Now solve the system:

8w_1 + b = 1

2w_1 + b = -1

Subtracting the second equation from the first:

(8w_1 + b) - (2w_1 + b) = 1 - (-1)

6w_1 = 2

w_1 = \frac{1}{3}

So,

w_2 = \frac{1}{3}

.
Substitute

w_1 = \frac{1}{3}

into

2w_1 + b = -1

2\left(\frac{1}{3}\right) + b = -1

\frac{2}{3} + b = -1

b = -1 - \frac{2}{3} = -\frac{5}{3}

The canonical parameters are

w = \begin{bmatrix} 1/3 \\ 1/3 \end{bmatrix}

and

b = -\frac{5}{3}

Step 2: Evaluate the options.

A. The weight vector $w$ is $[1, 1]^T$ .
The canonical weight vector is $w = [1/3, 1/3]^T$ . A vector $[1, 1]^T$ is in the same direction as the canonical $w$ .

B. The bias term $b$ is $-3$ .
For the canonical hyperplane, $b = -5/3$ . If $w$ were scaled to $[1, 1]^T$ (i.e., multiplied by 3), the bias term would be $3 \times (-5/3) = -5$ .

C. The number of support vectors is 3.
Let's check the functional margin $y_i(w^T x_i + b)$ for each point using the canonical $w = [1/3, 1/3]^T$ and $b = -5/3$ :

For $x_1=(4,4), y_1=1$ : $1 \cdot \left( \frac{1}{3}(4) + \frac{1}{3}(4) - \frac{5}{3} \right) = \frac{8}{3} - \frac{5}{3} = \frac{3}{3} = 1$ .

For $x_2=(2,0), y_2=-1$ : $-1 \cdot \left( \frac{1}{3}(2) + \frac{1}{3}(0) - \frac{5}{3} \right) = -1 \cdot \left( \frac{2}{3} - \frac{5}{3} \right) = -1 \cdot \left( -\frac{3}{3} \right) = 1$ .

For $x_3=(0,2), y_3=-1$ : $-1 \cdot \left( \frac{1}{3}(0) + \frac{1}{3}(2) - \frac{5}{3} \right) = -1 \cdot \left( \frac{2}{3} - \frac{5}{3} \right) = -1 \cdot \left( -\frac{3}{3} \right) = 1$ .

All three points have a functional margin of 1, meaning they are all support vectors for the canonical hyperplane.

**D. The margin is $\sqrt{2}$ .**
For the canonical $w = [1/3, 1/3]^T$ :

\|w\| = \sqrt{\left(\frac{1}{3}\right)^2 + \left(\frac{1}{3}\right)^2} = \sqrt{\frac{1}{9} + \frac{1}{9}} = \sqrt{\frac{2}{9}} = \frac{\sqrt{2}}{3}

The margin is

\frac{2}{\|w\|} = \frac{2}{\frac{\sqrt{2}}{3}} = \frac{6}{\sqrt{2}} = \frac{6\sqrt{2}}{2} = 3\sqrt{2}

.
"
:::

:::question type="MCQ" question="For a soft-margin SVM, what is the effect of choosing a very large value for the regularization parameter C?" options=["It results in a very wide margin, prioritizing simplicity over accuracy.","It heavily penalizes misclassified points, leading to a narrower margin and potentially overfitting.","It has no effect on the final model.","It forces the model to use a non-linear kernel."] answer="It heavily penalizes misclassified points, leading to a narrower margin and potentially overfitting." hint="Consider the objective function: $\min \frac{1}{2}\|w\|^2 + C \sum \xi_i$ . What happens when C is large?" solution="The objective function for a soft-margin SVM is a trade-off between maximizing the margin (minimizing $\|w\|^2$ ) and minimizing the classification errors (minimizing $\sum \xi_i$ ). The parameter C controls this trade-off. A very large C places a high penalty on the slack variables $\xi_i$ . To minimize the objective, the algorithm will try to make the $\xi_i$ as small as possible, even if it means choosing a hyperplane with a smaller margin. This makes the model less tolerant of misclassifications, fitting the training data very closely, which can lead to a narrow margin and overfitting."
:::

---

Summary

❗ Key Takeaways for GATE

Core Principle: SVM's primary goal is to find the maximal margin hyperplane, which is the decision boundary that is farthest from the nearest data points of both classes. This maximization of margin leads to better generalization.

Support Vectors are Key: The optimal hyperplane is determined exclusively by the support vectors—the data points lying on the margin boundaries. All other points are irrelevant to defining the boundary.

Hard vs. Soft Margin: The hard-margin SVM applies only to linearly separable data. The soft-margin SVM is more practical; it uses slack variables ( $\xi_i$ ) and a regularization parameter ( $C$ ) to handle non-separable data and outliers by allowing for some classification errors.

Margin Calculation: The width of the margin is given by the formula $\frac{2}{\|w\|}$ , where $w$ is the weight vector of the canonical hyperplane. Maximizing the margin is equivalent to minimizing $\|w\|^2$ .

---

What's Next?

💡 Continue Learning

This topic connects to:

Logistic Regression: Both are linear classifiers. It is insightful to compare their loss functions and how they derive their decision boundaries. SVM's boundary is determined by the "hardest" points (support vectors), while logistic regression's is influenced by all points.

Kernel Methods: The kernel trick is not unique to SVM. Understanding it provides a gateway to other kernel-based algorithms like Kernel PCA, which extend linear models to handle non-linear structures in data.

Master these connections for a more comprehensive understanding of classification algorithms in machine learning.

---

Chapter Summary

In this chapter, we have undertaken a detailed examination of several fundamental classification models, each offering a distinct approach to the task of assigning labels to data. We began with instance-based learning in k-Nearest Neighbors and proceeded to tree-based, probabilistic, and margin-based models. Our exploration has revealed that no single model is universally superior; the optimal choice is invariably contingent upon the specific characteristics of the dataset and the problem at hand.

📖 Classification Models - Key Takeaways

Paradigms of Classification: We have distinguished between several model types. Generative models (e.g., Naive Bayes, LDA) learn the joint probability distribution $P(X, Y)$ , whereas discriminative models (e.g., SVM, Decision Trees) directly learn the conditional probability $P(Y|X)$ or the decision boundary itself. Furthermore, parametric models (e.g., LDA, Naive Bayes) assume a specific functional form with a fixed number of parameters, while non-parametric models (e.g., k-NN, Decision Trees) make no such assumption, allowing model complexity to grow with the data.

The Nature of Decision Boundaries: Each model constructs a different form of decision boundary. Linear Discriminant Analysis (LDA) and Support Vector Machines (SVMs) with a linear kernel explicitly create linear boundaries. Decision Trees produce non-linear, axis-parallel (rectilinear) boundaries. The k-NN algorithm generates a complex, non-linear boundary derived from the local proximity of training instances.

The Central Role of Assumptions: The performance and suitability of these models are critically dependent on their underlying assumptions. The Naive Bayes classifier relies on the strong, or "naive," assumption of conditional independence among features. LDA assumes that the features are normally distributed with a common covariance matrix across all classes. Violations of these assumptions can significantly degrade model performance.

Computational Trade-offs: There exists a fundamental trade-off between computational complexity during training and inference. Lazy learners like k-NN have a trivial training phase but can be computationally expensive at prediction time, as they must compute distances to all training points. Conversely, models like SVMs may have a computationally intensive training phase to find the optimal hyperplane, but are typically fast during inference.

Controlling Model Complexity and Overfitting: We have seen that overfitting is a primary concern for models like Decision Trees. This is addressed through techniques such as pruning. For SVMs, the soft-margin constant, $C$ , acts as a regularization parameter that controls the trade-off between maximizing the margin and minimizing training error. For k-NN, the choice of $k$ determines the smoothness of the decision boundary and thus the model's complexity.

Power of the Kernel Trick: A key concept for SVMs is the kernel trick. This powerful technique enables the model to create non-linear decision boundaries by implicitly mapping the input features into a higher-dimensional space, where a linear separation may be possible, without ever explicitly computing the transformation.

---

Chapter Review Questions

:::question type="MCQ" question="A data scientist is choosing between Linear Discriminant Analysis (LDA) and a Gaussian Naive Bayes (GNB) classifier for a binary classification problem. Both models assume features follow a Gaussian distribution. What is the key difference in their assumptions that typically leads to different decision boundaries?" options=["LDA assumes a common covariance matrix for all classes, while GNB assumes a diagonal covariance matrix for each class (feature independence).","GNB assumes a common covariance matrix for all classes, while LDA assumes a diagonal covariance matrix for each class.","LDA is a non-parametric model, whereas GNB is a parametric model.","LDA produces a linear decision boundary, whereas GNB always produces a non-linear (quadratic) decision boundary."] answer="A" hint="Recall the 'naive' assumption in Naive Bayes and the specific assumption LDA makes to ensure its decision boundary is linear." solution="
The correct answer is A. Let us analyze the assumptions of each model.

Linear Discriminant Analysis (LDA):
LDA is a generative classifier that models the class-conditional densities $P(\mathbf{x}|y=k)$ as multivariate Gaussian distributions. A key assumption of LDA is that all classes share the same covariance matrix, i.e., $\Sigma_k = \Sigma$ for all classes $k$ . This assumption leads to a decision boundary that is a linear function of $\mathbf{x}$ . The log-ratio of posteriors becomes:

\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \log \frac{\pi_k}{\pi_l} - \frac{1}{2}(\mu_k+\mu_l)^T\Sigma^{-1}(\mu_k-\mu_l) + \mathbf{x}^T\Sigma^{-1}(\mu_k-\mu_l)

This is a linear equation in

\mathbf{x}

Gaussian Naive Bayes (GNB):
GNB also models $P(\mathbf{x}|y=k)$ as a Gaussian. However, it incorporates the "naive" assumption of conditional independence between features. This is equivalent to assuming that the covariance matrix for each class, $\Sigma_k$ , is a diagonal matrix. The off-diagonal elements, representing covariance between features, are zero. GNB does not assume that this diagonal covariance matrix is the same for all classes.

Option A correctly states this fundamental difference. LDA's shared covariance matrix assumption contrasts with GNB's feature independence assumption, which implies a diagonal covariance matrix that can differ for each class.
Option B incorrectly reverses the assumptions.
Option C is incorrect. Both LDA and GNB are parametric models, as they assume a specific (Gaussian) distribution for the data.
Option D is not always true. While GNB can produce a quadratic boundary (if the class-specific variances differ), it can also produce a linear one. The more fundamental distinction lies in the structure of the assumed covariance matrix.

" :::

:::question type="NAT" question="For a dataset with two classes, C1 and C2, a node in a decision tree contains 10 samples of C1 and 6 samples of C2. This node is split into two child nodes based on a feature. Child Node 1 contains 8 samples of C1 and 2 samples of C2. Child Node 2 contains 2 samples of C1 and 4 samples of C2. Calculate the Gini Gain (reduction in Gini Impurity) from this split. Provide the answer rounded to three decimal places." answer="0.102" hint="Gini Gain is calculated as the Gini Impurity of the parent node minus the weighted average of the Gini Impurities of the child nodes. The Gini Impurity for a node is $1 - \sum_{i} p_i^2$ ." solution="
We first calculate the Gini Impurity for the parent node and then for each child node.

1. Gini Impurity of the Parent Node:
The parent node has a total of $10 + 6 = 16$ samples.
The proportions of the classes are $p_1 = \frac{10}{16} = \frac{5}{8}$ and $p_2 = \frac{6}{16} = \frac{3}{8}$ .

Gini_{parent} = 1 - \left[ \left(\frac{5}{8}\right)^2 + \left(\frac{3}{8}\right)^2 \right] = 1 - \left[ \frac{25}{64} + \frac{9}{64} \right] = 1 - \frac{34}{64} = \frac{30}{64} = 0.46875

2. Gini Impurity of the Child Nodes:

Child Node 1: Contains $8+2=10$ samples. Proportions are $p_1 = \frac{8}{10} = \frac{4}{5}$ and $p_2 = \frac{2}{10} = \frac{1}{5}$ .

Gini_{child1} = 1 - \left[ \left(\frac{4}{5}\right)^2 + \left(\frac{1}{5}\right)^2 \right] = 1 - \left[ \frac{16}{25} + \frac{1}{25} \right] = 1 - \frac{17}{25} = \frac{8}{25} = 0.320

Child Node 2: Contains $2+4=6$ samples. Proportions are $p_1 = \frac{2}{6} = \frac{1}{3}$ and $p_2 = \frac{4}{6} = \frac{2}{3}$ .

Gini_{child2} = 1 - \left[ \left(\frac{1}{3}\right)^2 + \left(\frac{2}{3}\right)^2 \right] = 1 - \left[ \frac{1}{9} + \frac{4}{9} \right] = 1 - \frac{5}{9} = \frac{4}{9} \approx 0.4444...

3. Weighted Average Gini Impurity of Children:
The weights are the proportion of samples from the parent that go to each child.
Weight for Child 1: $\frac{10}{16}$ . Weight for Child 2: $\frac{6}{16}$ .

Gini_{children} = \left(\frac{10}{16}\right) Gini_{child1} + \left(\frac{6}{16}\right) Gini_{child2}

Gini_{children} = \left(\frac{5}{8}\right) (0.320) + \left(\frac{3}{8}\right) \left(\frac{4}{9}\right) = 0.200 + \frac{12}{72} = 0.200 + \frac{1}{6} \approx 0.200 + 0.1667 = 0.3667

4. Gini Gain:
The Gini Gain is the reduction in impurity.

\text{Gini Gain} = Gini_{parent} - Gini_{children} = 0.46875 - 0.3667 \approx 0.10205

Rounding to three decimal places, the answer is 0.102.
"
:::

:::question type="MSQ" question="Which of the following statements regarding Support Vector Machines (SVMs) are correct?" options=["The decision boundary is determined only by the support vectors.","The kernel trick allows SVMs to find non-linear decision boundaries by implicitly mapping data to a higher-dimensional space.","Increasing the regularization parameter $C$ in a soft-margin SVM generally leads to a wider margin and allows for more misclassifications of training points.","SVM is a generative model that estimates the probability distribution of the data."] answer="A,B" hint="Consider the mathematical formulation of the SVM objective function and the role of the parameter $C$ . How does an SVM differ from a model like LDA or Naive Bayes in its modeling approach?" solution="
Let us evaluate each statement.

A) The decision boundary is determined only by the support vectors.

This statement is correct. The optimal hyperplane in an SVM is defined by the points from each class that are closest to it (in the case of a hard margin) or are on the margin or violate it (in the case of a soft margin). These points are called support vectors. The positions of all other data points do not influence the final decision boundary.

B) The kernel trick allows SVMs to find non-linear decision boundaries by implicitly mapping data to a higher-dimensional space.

This statement is correct. This is the primary purpose of the kernel trick. By using a kernel function

K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)

, the SVM can learn a linear boundary in a high-dimensional feature space defined by

\phi

without ever having to compute the coordinates in that space. This corresponds to a non-linear boundary in the original input space.

C) Increasing the regularization parameter $C$ in a soft-margin SVM generally leads to a wider margin and allows for more misclassifications of training points.

This statement is incorrect. The parameter

C

controls the penalty for misclassified training examples. A large value of

C

corresponds to a high penalty, forcing the SVM to create a model with fewer misclassifications. This typically results in a narrower margin and can lead to overfitting. Conversely, a small

C

allows for more misclassifications and a wider margin, promoting generalization.

D) SVM is a generative model that estimates the probability distribution of the data.

This statement is incorrect. SVM is a quintessential discriminative model. It focuses solely on finding the decision boundary that best separates the classes, without making any assumptions about or attempting to model the underlying probability distributions of the data itself. Generative models, such as Naive Bayes or LDA, do model these distributions.

Therefore, the correct statements are A and B.
"
:::

---

What's Next?

💡 Continue Your GATE Journey

Having completed our study of Classification Models, we have established a firm foundation in supervised machine learning. The principles and algorithms discussed here are not isolated concepts but rather integral components of a larger ecosystem of machine learning techniques.

Key connections:

Relation to Previous Learning: This chapter directly builds upon foundational concepts of Supervised Learning, applying its principles to the specific task of classification. Furthermore, models like Naive Bayes and LDA are practical applications of the Probability Theory and Linear Algebra that form the mathematical bedrock of machine learning.

Building Blocks for Future Chapters: The concepts mastered here are prerequisites for more advanced topics.

- Ensemble Methods: Decision Trees, which we studied here, are the fundamental building blocks for powerful ensemble models like Random Forests and Gradient Boosting Machines, which combine multiple weak learners to create a single strong learner. - Model Evaluation and Hyperparameter Tuning: Our discussion of parameters like

k

in k-NN and

C

in SVMs naturally leads to the next critical topic: how to systematically evaluate model performance and tune these hyperparameters using techniques like Cross-Validation and Grid Search. - Advanced Models: The classifiers we have covered serve as essential baselines against which more complex models, such as Artificial Neural Networks and Deep Learning architectures, are compared. - Unsupervised Learning: The dimensionality reduction aspect of LDA provides a conceptual link to purely unsupervised techniques like Principal Component Analysis (PCA), which also seeks to find a lower-dimensional representation of data.

Classification Models

Classification Models

Overview

Chapter Contents

Learning Objectives

Part 1: k-Nearest Neighbors (k-NN)

Introduction

Key Concepts

1. The k-NN Classification Algorithm

2. Distance Metrics

3. The Role of 'k'

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Part 2: Decision Trees

Introduction

Key Concepts

1. Structure of a Decision Tree

2. The Splitting Process: Measuring Impurity

3. Entropy

4. Information Gain

5. Gini Impurity

6. Gini Gain

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Part 3: Naive Bayes Classifier

Introduction

Key Concepts

1. The Probabilistic Foundation: Bayes' Theorem

2. The 'Naive' Assumption: Conditional Independence

3. The Naive Bayes Model for Classification

4. Parameter Estimation

5. Making Predictions and Evaluating Misclassification

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Part 4: Linear Discriminant Analysis (LDA)

Introduction

Key Concepts

1. The Objective of LDA: Maximizing Class Separability

2. Scatter Matrices

3. Fisher's Linear Discriminant and the Optimization Problem

4. LDA as a Classifier

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Part 5: Support Vector Machine (SVM)

Introduction

Key Concepts

1. The Maximal Margin Classifier and Linear Separability

2. Support Vectors

3. Mathematical Formulation (Hard-Margin SVM)

4. The Soft-Margin Classifier

5. The Kernel Trick for Non-linear Data

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Chapter Summary

Chapter Review Questions

What's Next?

🎯 Key Points to Remember

Related Topics in Machine Learning

Dimensionality Reduction

Clustering

Model Evaluation and Validation

Neural Networks

More Resources

Study Notes

Short Notes