Descriptive Statistics

Overview

In the study of engineering and computer science, we are frequently confronted with vast datasets, from system performance logs to experimental results. The ability to distill these collections of raw numbers into a meaningful and concise summary is a foundational pillar of quantitative analysis. Descriptive statistics provides the formal methodology for this process. It is the branch of statistics concerned with summarizing the features of a dataset, rather than using the data to infer properties about a larger population. The primary objective is to present quantitative descriptions in a manageable form, revealing underlying patterns and providing a clear, objective basis for interpretation.

This chapter introduces the fundamental tools required for this task. We shall explore two principal categories of descriptive measures: measures of central tendency, which identify the "center" or typical value of a distribution, and measures of dispersion, which quantify the "spread" or variability of the data points around that center. A firm command of these concepts is indispensable for the GATE examination, where questions frequently test one's ability to analyze given data distributions and compute their characteristic properties. Mastery of this material is not merely about calculation; it is about developing the analytical acumen to correctly interpret and compare datasets, a critical skill for any practicing engineer.

---

Chapter Contents

| # | Topic | What You'll Learn |
|---|-------|-------------------|
| 1 | Measures of Central Tendency | Summarizing data with a single, central value. |
| 2 | Measures of Dispersion | Quantifying the spread or variability of data. |

---

Learning Objectives

❗ By the End of This Chapter

After completing this chapter, you will be able to:

Calculate and interpret the mean, median, and mode for both grouped and ungrouped data.

Compute and analyze the range, variance, and standard deviation ( $\sigma$ ) of a dataset.

Differentiate between measures of central tendency and dispersion, and identify the appropriate measure for a given data distribution.

Apply the principles of descriptive statistics to solve numerical problems typical of the GATE examination.

---

We now turn our attention to Measures of Central Tendency...

Part 1: Measures of Central Tendency

Introduction

In the study of statistics, our primary objective is often to distill a large set of observations into a few representative numerical values. When confronted with a dataset, whether it represents student marks, network packet delays, or processor clock cycles, we first seek a single value that can be considered the "center" or "typical" value of the distribution. Such a value provides a concise summary of the entire dataset, anchoring our understanding of its overall location on the number line.

The techniques for identifying this central point are known as measures of central tendency. These measures are fundamental to descriptive statistics, forming the bedrock upon which more complex analyses, such as measures of dispersion and inferential statistics, are built. For the GATE examination, a firm grasp of the three primary measures—the mean, the median, and the mode—is essential for interpreting data distributions and solving related quantitative problems. We shall explore the definition, calculation, and appropriate application of each of these measures for both ungrouped and grouped data.

📖 Measure of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The primary measures are the mean, median, and mode.

---

Key Concepts

1. The Arithmetic Mean

The arithmetic mean, or simply the mean, is the most commonly used measure of central tendency. It is calculated by summing all the values in a dataset and dividing by the number of values. We can consider its calculation for two types of data presentations: ungrouped and grouped.

For Ungrouped Data:
If we have a set of $n$ observations $x_1, x_2, \dots, x_n$ , the mean (denoted by $\bar{x}$ ) is their sum divided by the number of observations.

📐 Arithmetic Mean (Ungrouped Data)

\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

Variables:

$\bar{x}$ = The arithmetic mean

$x_i$ = The $i$ -th observation in the dataset

$n$ = The total number of observations

When to use: For finding the average of a raw list of numerical values.

For Grouped Data:
When data is presented in a frequency distribution, we use the midpoint of each class interval as a representative value for that class.

📐 Arithmetic Mean (Grouped Data)

\bar{x} = \frac{\sum_{i=1}^{k} f_i x_i}{\sum_{i=1}^{k} f_i} = \frac{\sum f_i x_i}{N}

Variables:

$\bar{x}$ = The arithmetic mean

$f_i$ = The frequency of the $i$ -th class

$x_i$ = The midpoint of the $i$ -th class interval

$k$ = The number of classes

$N = \sum f_i$ = The total frequency

When to use: For data organized into class intervals with corresponding frequencies.

Worked Example:

Problem: The marks obtained by 10 students in a test are: 15, 20, 25, 20, 15, 30, 45, 50, 20, 10. Calculate the mean mark.

Solution:

Step 1: Identify the observations and the total count.
The observations are $x_i = \{15, 20, 25, 20, 15, 30, 45, 50, 20, 10\}$ .
The number of observations is $n = 10$ .

Step 2: Sum all the observations.

\sum_{i=1}^{10} x_i = 15 + 20 + 25 + 20 + 15 + 30 + 45 + 50 + 20 + 10

Step 3: Compute the sum.

\sum x_i = 250

Step 4: Apply the formula for the mean.

\bar{x} = \frac{\sum x_i}{n} = \frac{250}{10}

Result:

\bar{x} = 25

Answer: The mean mark is $25$ .

---

2. The Median

The median is the middle value of a dataset that has been arranged in ascending or descending order. It effectively divides the dataset into two equal halves. Unlike the mean, the median is not affected by extremely large or small values (outliers), making it a more robust measure of central tendency for skewed distributions.

For Ungrouped Data:
First, we must arrange the data in order.

If the number of observations $n$ is odd, the median is the value at the $\left(\frac{n+1}{2}\right)^{th}$ position.

If the number of observations $n$ is even, the median is the average of the two middle values, which are at the $\left(\frac{n}{2}\right)^{th}$ and $\left(\frac{n}{2} + 1\right)^{th}$ positions.

For Grouped Data:
For data in a frequency distribution, the median is found within the median class, which is the class interval where the cumulative frequency crosses

N/2

📐 Median (Grouped Data)

\text{Median} = L + \left( \frac{\frac{N}{2} - C}{f} \right) \times h

Variables:

$L$ = Lower limit of the median class

$N$ = Total frequency ( $\sum f$ )

$C$ = Cumulative frequency of the class preceding the median class

$f$ = Frequency of the median class

$h$ = Width of the median class interval

When to use: For finding the central value in ordered data, especially when the data is skewed or contains outliers.

Worked Example:

Problem: Find the median of the following dataset: 9, 3, 5, 1, 8, 10, 6.

Solution:

Step 1: Arrange the data in ascending order.

1, 3, 5, 6, 8, 9, 10

Step 2: Determine the number of observations, $n$ .
Here, $n = 7$ , which is an odd number.

Step 3: Find the position of the median.
The position is $\left(\frac{n+1}{2}\right)^{th} = \left(\frac{7+1}{2}\right)^{th} = 4^{th}$ position.

Step 4: Identify the value at this position.
The value at the 4th position in the sorted list is 6.

Result:

\text{Median} = 6

Answer: The median of the dataset is $6$ .

---

3. The Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), more than two modes (multimodal), or no mode at all if all values appear with the same frequency.

For Grouped Data:
The mode lies in the modal class, which is the class interval with the highest frequency.

📐 Mode (Grouped Data)

\text{Mode} = L + \left( \frac{f_1 - f_0}{2f_1 - f_0 - f_2} \right) \times h

Variables:

$L$ = Lower limit of the modal class

$f_1$ = Frequency of the modal class

$f_0$ = Frequency of the class preceding the modal class

$f_2$ = Frequency of the class succeeding the modal class

$h$ = Width of the modal class interval

When to use: To identify the most frequent or popular value in a dataset, applicable to both numerical and categorical data.

4. Relationship between Mean, Median, and Mode

For a perfectly symmetrical distribution, the mean, median, and mode are identical. However, for skewed distributions, these measures diverge. An empirical relationship exists for moderately skewed distributions.

📐 Empirical Relationship

\text{Mode} \approx 3 \cdot \text{Median} - 2 \cdot \text{Mean}

When to use: To estimate one measure of central tendency when the other two are known, particularly for unimodal, moderately skewed distributions.

The relationship can be visualized as follows:

Mode

Median

Mean
Positive Skew
(Mean > Median > Mode)

Mode

Median

Mean
Negative Skew
(Mean < Median < Mode)

---

Problem-Solving Strategies

💡 GATE Strategy

Identify Data Type: First, determine if the data is ungrouped (a list of numbers) or grouped (a frequency table). The choice of formula depends entirely on this.

Check for Outliers: When a question asks for the "most appropriate" measure, inspect the data for extreme values (outliers). If outliers are present, the median is generally a better representative of the center than the mean.

Order the Data: For median calculation with ungrouped data, always remember the first step is to sort the data. A common error is to find the middle value of the unsorted list.

Use the Empirical Formula: If two of the three measures (mean, median, mode) are given and the third is asked, the empirical formula is the fastest way to an approximate answer, assuming a moderately skewed distribution.

---

Common Mistakes

⚠️ Avoid These Errors

❌ Confusing Formulas: Using the ungrouped data formula ( $\sum x_i / n$ ) for grouped data, or vice versa.

✅ Always check if frequencies (

f_i

) are involved. If yes, it is grouped data.

❌ Median Position vs. Value: Forgetting to sort the data before finding the median. Also, confusing the position of the median (e.g., the 5th element) with the median's actual value.

✅ Sort first. Then find the position. Finally, identify the value at that position.

❌ Mode Calculation for Grouped Data: Incorrectly identifying $f_0$ , $f_1$ , and $f_2$ . $f_1$ is the frequency of the modal class itself, $f_0$ is the one before, and $f_2$ is the one after.

✅ Clearly label the frequencies relative to the modal class before substituting them into the formula.

---

Practice Questions

:::question type="NAT" question="The scores of a batsman in 10 innings are: 38, 70, 48, 34, 42, 55, 63, 46, 54, 44. The mean score is:" answer="49.4" hint="Sum all the scores and divide by the number of innings." solution="
Step 1: Sum the scores.

\sum x_i = 38 + 70 + 48 + 34 + 42 + 55 + 63 + 46 + 54 + 44

Step 2: Compute the sum.

\sum x_i = 494

Step 3: Divide by the number of innings, $n=10$ .

\bar{x} = \frac{494}{10}

Result:

\bar{x} = 49.4

"
:::

:::question type="MCQ" question="For a moderately skewed distribution, the mean is 30 and the mode is 24. What is the approximate value of the median?" options=["26", "27", "28", "29"] answer="28" hint="Use the empirical relationship: Mode ≈ 3 Median - 2 Mean." solution="
Step 1: State the empirical formula.

\text{Mode} = 3 \cdot \text{Median} - 2 \cdot \text{Mean}

Step 2: Substitute the given values.

24 = 3 \cdot \text{Median} - 2 \cdot (30)

Step 3: Simplify the equation.

24 = 3 \cdot \text{Median} - 60

Step 4: Solve for the Median.

3 \cdot \text{Median} = 24 + 60

3 \cdot \text{Median} = 84

\text{Median} = \frac{84}{3}

Result:

\text{Median} = 28

"
:::

:::question type="MCQ" question="What is the median of the dataset: 12, 4, 15, 7, 20, 9, 11?" options=["7", "9", "11", "12"] answer="11" hint="First, arrange the data in ascending order and then find the middle element." solution="
Step 1: Arrange the data in ascending order.

4, 7, 9, 11, 12, 15, 20

Step 2: Count the number of observations, $n$ .
Here, $n=7$ , which is an odd number.

Step 3: Find the position of the median term.
The position is $\left( \frac{n+1}{2} \right)^{th} = \left( \frac{7+1}{2} \right)^{th} = 4^{th}$ .

Step 4: Identify the value at the 4th position.
The 4th value in the sorted list is 11.

Result:

\text{Median} = 11

"
:::

:::question type="MSQ" question="Which of the following statements about measures of central tendency are correct?" options=["The mean is sensitive to outliers.", "The median is always one of the data points in the dataset.", "The mode can be used for categorical data.", "For a positively skewed distribution, Mean > Median > Mode."] answer="The mean is sensitive to outliers.,The mode can be used for categorical data.,For a positively skewed distribution, Mean > Median > Mode." hint="Evaluate each statement's validity. Consider edge cases for the median." solution="

The mean is sensitive to outliers: Correct. A single very large or very small value can significantly change the mean.

The median is always one of the data points in the dataset: Incorrect. If there is an even number of data points, the median is the average of the two middle points, which may not be in the original dataset (e.g., median of {2, 4, 6, 8} is 5).

The mode can be used for categorical data: Correct. The mode is the most frequent category (e.g., the most common color in a set of cars).

For a positively skewed distribution, Mean > Median > Mode: Correct. The tail of the distribution is on the right, which pulls the mean to the right of the median.

"
:::

---

Summary

❗ Key Takeaways for GATE

Mean ( $\bar{x}$ ): The arithmetic average. It is calculated as $\frac{\sum x_i}{n}$ for ungrouped data and $\frac{\sum f_i x_i}{\sum f_i}$ for grouped data. It is highly sensitive to outliers.

Median: The middle value of an ordered dataset. It is robust to outliers and is preferred for skewed data. For grouped data, it is found using the formula involving the median class.

Mode: The most frequently occurring value. It is the only measure that can be used for categorical data.

Relationship: For moderately skewed distributions, remember the empirical formula: $\text{Mode} \approx 3 \cdot \text{Median} - 2 \cdot \text{Mean}$ . This relationship also dictates the order of the three measures in skewed distributions.

---

What's Next?

💡 Continue Learning

This topic provides the foundation for describing data. We can now build upon this understanding:

Measures of Dispersion: While central tendency tells us about the center of the data, measures of dispersion (like variance and standard deviation) tell us how spread out the data is around that center. A complete description of data requires both.

Probability Distributions: The concepts of mean (as expected value), median, and mode are critical parameters used to define and understand standard probability distributions like the Normal, Poisson, and Binomial distributions.

Mastering these connections is crucial for a comprehensive understanding of Probability and Statistics for the GATE examination.

---

💡 Moving Forward

Now that you understand Measures of Central Tendency, let's explore Measures of Dispersion which builds on these concepts.

---

Part 2: Measures of Dispersion

Introduction

While measures of central tendency, such as the mean or median, provide a single value to represent the center of a dataset, they offer an incomplete picture of the data's characteristics. Two datasets may possess identical means yet exhibit vastly different distributions. To fully comprehend a dataset, we must also quantify the extent to which its values are spread out or clustered together. This quantification of spread is known as dispersion.

Measures of dispersion are statistical tools that describe the variability or scatter in a set of observations. A low measure of dispersion indicates that the data points tend to be clustered closely around the center, suggesting high uniformity. Conversely, a high measure of dispersion signifies that the data points are spread out over a wider range of values. In engineering and computer science, understanding dispersion is critical for applications ranging from performance analysis of algorithms to quality control in manufacturing processes.

📖 Statistical Dispersion

Statistical dispersion (also known as variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed. It is a non-negative real number that is zero if all the data are the same and increases as the data become more diverse.

---

Key Concepts

We will now examine the fundamental measures used to quantify dispersion.

1. Range

The range is the most straightforward measure of dispersion. It is defined as the difference between the maximum and minimum values in a dataset. While simple to compute, it is sensitive to outliers and considers only the two extreme values, ignoring the distribution of the remaining data points.

📐 Range

R = x_{max} - x_{min}

Variables:

$x_{max}$ = Maximum value in the dataset

$x_{min}$ = Minimum value in the dataset

When to use: For a quick, preliminary assessment of spread when the data is not prone to extreme outliers.

Worked Example:

Problem: Find the range for the dataset: $\{12, 15, 7, 22, 18, 5\}$ .

Solution:

Step 1: Identify the maximum and minimum values.

x_{max} = 22

x_{min} = 5

Step 2: Apply the formula for the range.

R = 22 - 5

Step 3: Compute the final answer.

R = 17

Answer: The range of the dataset is $17$ .

---

2. Variance and Standard Deviation

Variance and standard deviation are the most common and robust measures of dispersion. They quantify the average degree to which each point differs from the mean.

The variance is the average of the squared differences from the mean. Squaring the differences ensures that they are all positive and gives more weight to larger deviations. The standard deviation is simply the square root of the variance, which returns the measure of spread to the original units of the data, making it more interpretable.

μ
Large σ (High Dispersion)
Small σ (Low Dispersion)

📐 Population Variance and Standard Deviation

Population\ Variance: \sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}

Population\ Standard\ Deviation: \sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}

Variables:

$\mu$ = Population mean

$x_i$ = Each value in the population

$N$ = Total number of values in the population

When to use: When the dataset represents the entire population of interest.

❗ Sample vs. Population

In GATE, you will often work with a sample of data rather than the entire population. When calculating variance from a sample, we use $n-1$ in the denominator (Bessel's correction) to obtain an unbiased estimate of the population variance.

Sample Variance: $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$

Sample Standard Deviation: $s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}$

Here, $\bar{x}$ is the sample mean and $n$ is the sample size. Unless specified otherwise, assume you are working with a population for simplicity, but be aware of this distinction.

Worked Example:

Problem: Calculate the population variance and standard deviation for the dataset: $\{2, 4, 4, 4, 5, 5, 7, 9\}$ .

Solution:

Step 1: Calculate the population mean ( $\mu$ ).

\mu = \frac{2+4+4+4+5+5+7+9}{8} = \frac{40}{8} = 5

Step 2: Calculate the squared differences from the mean.

$(2-5)^2 = (-3)^2 = 9$

$(4-5)^2 = (-1)^2 = 1$

$(4-5)^2 = (-1)^2 = 1$

$(4-5)^2 = (-1)^2 = 1$

$(5-5)^2 = (0)^2 = 0$

$(5-5)^2 = (0)^2 = 0$

$(7-5)^2 = (2)^2 = 4$

$(9-5)^2 = (4)^2 = 16$

Step 3: Sum the squared differences.

\sum (x_i - \mu)^2 = 9 + 1 + 1 + 1 + 0 + 0 + 4 + 16 = 32

Step 4: Calculate the variance ( $\sigma^2$ ).

\sigma^2 = \frac{32}{8} = 4

Step 5: Calculate the standard deviation ( $\sigma$ ).

\sigma = \sqrt{4} = 2

Answer: The population variance is $4$ , and the standard deviation is $2$ .

---

3. Coefficient of Variation (CV)

The standard deviation is an absolute measure of dispersion. It is expressed in the same units as the data. To compare the variability of two or more datasets with different means or different units, we require a relative measure of dispersion. The Coefficient of Variation is such a measure.

📐 Coefficient of Variation

CV = \frac{\sigma}{\mu}

Variables:

$\sigma$ = Standard deviation

$\mu$ = Mean

When to use: To compare the relative variability of two datasets, especially when their means are significantly different or they are measured in different units. It is often expressed as a percentage (

CV \times 100\%

---

Problem-Solving Strategies

💡 GATE Strategy

For computational problems involving variance, a more direct formula is often faster and less prone to rounding errors than the definitional formula.

The computational formula for population variance is:

\sigma^2 = \frac{\sum x_i^2}{N} - \mu^2 = \frac{\sum x_i^2}{N} - \left(\frac{\sum x_i}{N}\right)^2

This avoids calculating each individual deviation from the mean. First, calculate the sum of the squares ( $\sum x_i^2$ ) and the sum of the values ( $\sum x_i$ ), then substitute into the formula.

---

Common Mistakes

⚠️ Avoid These Errors

❌ Forgetting the Square Root: Confusing variance with standard deviation. Variance is in squared units, while standard deviation is in the original units.

✅ Always take the square root of the variance to find the standard deviation.

❌ Using N instead of N-1: Applying the population formula ( $N$ ) when a sample variance ( $n-1$ ) is required.

✅ Read the question carefully. If it implies the data is a sample drawn from a larger population, use the formula with Bessel's correction (

n-1

❌ Ignoring Units: Comparing standard deviations of datasets with different units (e.g., comparing height in cm with weight in kg).

✅ Use the Coefficient of Variation (CV) for a dimensionless comparison of variability.

---

Practice Questions

:::question type="NAT" question="The scores of a student in 5 tests are 10, 12, 15, 13, 10. Calculate the population variance of these scores." answer="2.8" hint="First, find the mean of the scores. Then, use the formula for population variance: sum of squared differences from the mean, divided by the number of scores." solution="
Step 1: Calculate the mean ( $\mu$ ) of the scores.

\mu = \frac{10 + 12 + 15 + 13 + 10}{5} = \frac{60}{5} = 12

Step 2: Calculate the sum of the squared differences from the mean.

\sum (x_i - \mu)^2 = (10-12)^2 + (12-12)^2 + (15-12)^2 + (13-12)^2 + (10-12)^2

= (-2)^2 + (0)^2 + (3)^2 + (1)^2 + (-2)^2

= 4 + 0 + 9 + 1 + 4 = 18

Step 3: Calculate the population variance ( $\sigma^2$ ).

\sigma^2 = \frac{\sum (x_i - \mu)^2}{N} = \frac{18}{5} = 3.6

Wait, let me re-check my calculation.
Sum is 60. Mean is 12. Correct.
(10-12)^2 = 4
(12-12)^2 = 0
(15-12)^2 = 9
(13-12)^2 = 1
(10-12)^2 = 4
Sum of squares = 4+0+9+1+4 = 18. Correct.
Variance = 18/5 = 3.6.
Let me change the question or the answer. Let's change the numbers to get a cleaner answer.
Let the scores be 10, 8, 12, 10, 15.
Sum = 55. Mean = 11.
(10-11)^2 = 1
(8-11)^2 = 9
(12-11)^2 = 1
(10-11)^2 = 1
(15-11)^2 = 16
Sum of squares = 1+9+1+1+16 = 28.
Variance = 28/5 = 5.6.
Let's try another set: 6, 7, 8, 9, 10.
Mean = 8.
(6-8)^2 = 4
(7-8)^2 = 1
(8-8)^2 = 0
(9-8)^2 = 1
(10-8)^2 = 4
Sum of squares = 4+1+0+1+4 = 10.
Variance = 10/5 = 2.
This is a good question.
Let's re-create the original question with the answer 2.8
Let the scores be 4, 5, 6, 8, 7.
N=5. Sum = 30. Mean = 6.
(4-6)^2 = 4
(5-6)^2 = 1
(6-6)^2 = 0
(8-6)^2 = 4
(7-6)^2 = 1
Sum of squared diff = 4+1+0+4+1 = 10. Variance = 10/5 = 2.
Let's try to get 2.8.
Sum of squared diff / N = 2.8. If N=5, sum of squared diff = 14.
Let mean be $\mu$ . We need $\sum(x_i-\mu)^2 = 14$ .
Let's try scores: 10, 12, 15, 13, 10. Mean is 12. Sum of squared diff is 18. Variance is 3.6.
Let's stick with the numbers 6, 7, 8, 9, 10. The variance is 2.
Question: The scores of a student in 5 tests are 6, 7, 8, 9, 10. Calculate the population variance of these scores.
Answer: 2
Solution:
Step 1: Calculate the mean ( $\mu$ ) of the scores.

\mu = \frac{6 + 7 + 8 + 9 + 10}{5} = \frac{40}{5} = 8

Step 2: Calculate the sum of the squared differences from the mean.

\sum (x_i - \mu)^2 = (6-8)^2 + (7-8)^2 + (8-8)^2 + (9-8)^2 + (10-8)^2

= (-2)^2 + (-1)^2 + (0)^2 + (1)^2 + (2)^2

= 4 + 1 + 0 + 1 + 4 = 10

Step 3: Calculate the population variance (

\sigma^2

\sigma^2 = \frac{\sum (x_i - \mu)^2}{N} = \frac{10}{5} = 2

Result: The population variance is 2.
This is a better question. Let's use this.
:::

:::question type="NAT" question="The scores of a student in 5 tests are 6, 7, 8, 9, 10. Calculate the population variance of these scores." answer="2" hint="First, find the mean of the scores. Then, use the formula for population variance: sum of squared differences from the mean, divided by the number of scores." solution="
Step 1: Calculate the mean ( $\mu$ ) of the scores.

\mu = \frac{6 + 7 + 8 + 9 + 10}{5} = \frac{40}{5} = 8

Step 2: Calculate the sum of the squared differences from the mean.

\sum (x_i - \mu)^2 = (6-8)^2 + (7-8)^2 + (8-8)^2 + (9-8)^2 + (10-8)^2

= (-2)^2 + (-1)^2 + (0)^2 + (1)^2 + (2)^2

= 4 + 1 + 0 + 1 + 4 = 10

Step 3: Calculate the population variance ( $\sigma^2$ ).

\sigma^2 = \frac{\sum (x_i - \mu)^2}{N} = \frac{10}{5} = 2

Result: The population variance is 2.
"
:::

:::question type="MCQ" question="A factory produces two types of microchips, A and B. For a batch of Chip A, the mean lifetime is 1000 hours with a standard deviation of 100 hours. For a batch of Chip B, the mean lifetime is 1200 hours with a standard deviation of 150 hours. Which of the following statements is correct regarding their relative variability?" options=["Chip A is more variable than Chip B","Chip B is more variable than Chip A","Both have the same variability","Variability cannot be compared"] answer="Chip B is more variable than Chip A" hint="To compare relative variability between datasets with different means, calculate the Coefficient of Variation (CV) for each." solution="
Step 1: Recall the formula for the Coefficient of Variation (CV).

CV = \frac{\sigma}{\mu}

Step 2: Calculate the CV for Chip A.

CV_A = \frac{100}{1000} = 0.10

Step 3: Calculate the CV for Chip B.

CV_B = \frac{150}{1200} = \frac{15}{120} = \frac{1}{8} = 0.125

Step 4: Compare the CV values.

CV_B (0.125) > CV_A (0.10)

Result: A higher CV indicates greater relative variability. Therefore, the lifetime of Chip B is more variable relative to its mean than that of Chip A.
"
:::

:::question type="MSQ" question="Which of the following statements about measures of dispersion are true?" options=["The standard deviation can be a negative value.","If all values in a dataset are identical, the variance is 0.","The range is highly sensitive to outliers.","The unit of variance is the same as the unit of the original data."] answer="If all values in a dataset are identical, the variance is 0.,The range is highly sensitive to outliers." hint="Analyze each statement based on the definitions of standard deviation, variance, and range." solution="

Option A: The standard deviation is the square root of the variance. Variance is a sum of squared values, so it can never be negative. Thus, the standard deviation is always non-negative. This statement is false.

Option B: If all values are identical, say $x_i = c$ for all $i$ , then the mean $\mu = c$ . Every term $(x_i - \mu)^2$ becomes $(c-c)^2 = 0$ . The sum is 0, and therefore the variance is 0. This statement is true.

Option C: The range is calculated as $x_{max} - x_{min}$ . An extreme outlier would become either the new maximum or minimum, directly and significantly affecting the range. This statement is true.

Option D: The variance is the average of squared deviations. Therefore, its unit is the square of the original data's unit (e.g., if data is in meters, variance is in meters-squared). The standard deviation has the same unit as the original data. This statement is false.

Result: The correct statements are that variance is 0 for an identical dataset and the range is sensitive to outliers.
"
:::
---

Summary

❗ Key Takeaways for GATE

Dispersion quantifies spread: Measures of dispersion like range, variance, and standard deviation describe how spread out a dataset is.

Variance and Standard Deviation are key: Variance ( $\sigma^2$ ) is the average squared deviation from the mean. Standard Deviation ( $\sigma$ ) is its square root and is more interpretable as it is in the original units of the data.

Use Coefficient of Variation for relative comparison: To compare the variability of two datasets with different means or units, always use the Coefficient of Variation ( $CV = \sigma / \mu$ ). A higher CV implies greater relative variability.

---

What's Next?

💡 Continue Learning

Understanding dispersion is foundational for more advanced topics in statistics. This knowledge directly connects to:

Probability Distributions: Parameters of distributions like the Normal distribution are defined by a mean ( $\mu$ ) and a standard deviation ( $\sigma$ ). Dispersion is central to defining the shape and spread of these distributions.

Inferential Statistics: Concepts like confidence intervals and hypothesis testing rely heavily on the standard deviation and variance of sample data to make inferences about a population.

---

Chapter Summary

In this chapter, we have explored the fundamental principles of descriptive statistics, focusing on the methods used to summarize and describe the main features of a dataset. We began by examining measures of central tendency, which provide a single value to represent the center of a distribution. Following this, we investigated measures of dispersion, which quantify the extent to which the data points are spread out. A thorough understanding of these concepts is not merely foundational; it is essential for the quantitative analysis required in various engineering disciplines.

📖 Descriptive Statistics - Key Takeaways

Purpose of Descriptive Statistics: The primary goal is to summarize a collection of data in a clear and understandable way. We are not making inferences about a larger population, but rather describing the sample at hand.

Measures of Central Tendency: The mean ( $\mu$ or $\bar{x}$ ), median, and mode are the principal measures. It is imperative to understand their sensitivities: the mean is highly sensitive to outliers, whereas the median is robust and often preferred for skewed distributions.

Measures of Dispersion: Variance ( $\sigma^2$ ) and standard deviation ( $\sigma$ ) are the most critical measures of spread. Variance measures the average squared deviation from the mean, while the standard deviation, its square root, returns the measure of spread to the original units of the data, making it more interpretable.

Quartiles and the Interquartile Range (IQR): Quartiles divide a dataset into four equal parts. The Interquartile Range ( $IQR = Q_3 - Q_1$ ) describes the spread of the middle 50% of the data and, like the median, is a robust measure against outliers.

Effect of Linear Transformations: For a dataset $X$ , if a new dataset $Y$ is created by the transformation $Y = aX + b$ , the new mean is $\bar{y} = a\bar{x} + b$ , and the new standard deviation is $\sigma_y = |a|\sigma_x$ . The variance becomes $\sigma_y^2 = a^2\sigma_x^2$ . This property is frequently tested.

Population vs. Sample Formulas: We must distinguish between population parameters (e.g., $\mu, \sigma^2$ ) and sample statistics (e.g., $\bar{x}, s^2$ ). The formula for sample variance uses a denominator of $n-1$ to provide an unbiased estimate of the population variance, whereas the population variance formula uses $N$ .

---

Chapter Review Questions

:::question type="MCQ" question="A dataset consists of the following five values: {10, 20, 30, 40, 150}. Which of the following statements is the most accurate description of the effect of the outlier (150) on the measures of central tendency?" options=["The mean will be significantly greater than the median.","The median will be significantly greater than the mean.","The mode will be the most representative measure of central tendency.","The mean and median will be approximately equal."] answer="A" hint="Consider how the arithmetic mean is calculated versus how the median is determined. Which one is more affected by extreme values?" solution="Let us analyze the dataset $S = \{10, 20, 30, 40, 150\}$ .

Calculate the Median: The dataset is already sorted. The median is the middle value of the dataset. For a set of 5 items, the median is the 3rd item.

\text{Median} = 30

Calculate the Mean: The mean is the sum of the values divided by the count of the values.

\text{Mean} (\bar{x}) = \frac{10 + 20 + 30 + 40 + 150}{5} = \frac{250}{5} = 50

Compare the Measures: We observe that the mean (50) is significantly greater than the median (30). This is a classic effect of a large positive outlier (a "right skew"). The outlier pulls the mean towards it, while the median, being position-based, is unaffected. The mode is not defined here as no value repeats. Therefore, the mean is inflated by the outlier and is much larger than the median.

"
:::

:::question type="NAT" question="The variance of a set of 20 data points is found to be 9. If each data point is first multiplied by 2 and then increased by 5, what is the new standard deviation of the resulting dataset?" answer="6" hint="Recall the effect of a linear transformation $Y = aX + b$ on the standard deviation. Does the additive constant 'b' affect the spread of the data?" solution="Let the original dataset be represented by the random variable $X$ , and the new dataset by $Y$ .

Given Information:

- The original variance is

\text{Var}(X) = \sigma_x^2 = 9

.
- The original standard deviation is the square root of the variance:

\sigma_x = \sqrt{9} = 3

Linear Transformation:

- Each data point is transformed according to the rule

Y = 2X + 5

.
- This is a linear transformation of the form

Y = aX + b

, with

a = 2

and

b = 5

Effect on Standard Deviation:

- The properties of linear transformations on standard deviation state that

\sigma_y = |a|\sigma_x

.
- The additive constant

b

shifts the entire dataset but does not change its spread, so it has no effect on the standard deviation or variance.

Calculation:

- Substitute the values of

a

and

\sigma_x

\sigma_y = |2| \times 3 = 2 \times 3 = 6

- The new standard deviation is 6.
"
:::

:::question type="MCQ" question="For the dataset {8, 15, 12, 5, 22, 18, 10, 25}, what is the value of the Interquartile Range (IQR)?" options=["8","10","11.5","20"] answer="B" hint="First, sort the data. Then, find the median (Q2), followed by the median of the lower half (Q1) and the median of the upper half (Q3)." solution="

Sort the Dataset: First, we must arrange the data in ascending order.

S = \{5, 8, 10, 12, 15, 18, 22, 25\}

The number of data points is

n=8

Find the Quartiles:

- First Quartile ( $Q_1$ ):

Q_1

is the median of the lower half of the data. The lower half is

\{5, 8, 10, 12\}

. Since there is an even number of points in this subset,

Q_1

is the average of the two middle values.

Q_1 = \frac{8 + 10}{2} = 9

- Third Quartile ( $Q_3$ ):

Q_3

is the median of the upper half of the data. The upper half is

\{15, 18, 22, 25\}

. Similarly,

Q_3

is the average of the two middle values.

Q_3 = \frac{18 + 22}{2} = \frac{40}{2} = 20

Calculate the IQR: The Interquartile Range is the difference between the third and first quartiles.

\text{IQR} = Q_3 - Q_1 = 20 - 9 = 11

It appears there was a miscalculation in the thought process for the options. Let's re-evaluate based on a common alternative method for quartile calculation.

Alternative Method (Inclusive/Exclusive depends on convention, let's stick to the common textbook method above).
Let's re-check the calculation.
$S = \{5, 8, 10, 12, 15, 18, 22, 25\}$
Lower half: $\{5, 8, 10, 12\}$ . Median $Q_1 = (8+10)/2 = 9$ .
Upper half: $\{15, 18, 22, 25\}$ . Median $Q_3 = (18+22)/2 = 20$ .
IQR = $20 - 9 = 11$ .

Textbook Author Note: There is a discrepancy between my calculated answer (11) and the provided options. This suggests an error in the question's options. Let me create a new question that works perfectly.

---
Revised Question for Integrity
:::question type="MCQ" question="For the dataset {2, 7, 5, 18, 15, 25, 10, 22}, what is the value of the Interquartile Range (IQR)?" options=["10", "15", "11", "17"] answer="C" hint="First, sort the data. Then, find the median (Q2), followed by the median of the lower half (Q1) and the median of the upper half (Q3)." solution="

Sort the Dataset: First, we must arrange the data in ascending order.

S = \{2, 5, 7, 10, 15, 18, 22, 25\}

The number of data points is

n=8

Find the Quartiles:

- First Quartile ( $Q_1$ ):

Q_1

is the median of the lower half of the data. The lower half is

\{2, 5, 7, 10\}

. Since there is an even number of points in this subset,

Q_1

is the average of the two middle values.

Q_1 = \frac{5 + 7}{2} = 6

- Third Quartile ( $Q_3$ ):

Q_3

is the median of the upper half of the data. The upper half is

\{15, 18, 22, 25\}

. Similarly,

Q_3

is the average of the two middle values.

Q_3 = \frac{18 + 22}{2} = \frac{40}{2} = 20

Calculate the IQR: The Interquartile Range is the difference between the third and first quartiles.

\text{IQR} = Q_3 - Q_1 = 20 - 6 = 14

Author Note 2: Another calculation error. Let's try again, very carefully. This is why proofreading is critical.
$Q_1 = (5+7)/2 = 6$ .
$Q_3 = (18+22)/2 = 20$ .
IQR = $20-6 = 14$ . Let me create a question where the numbers work out cleanly with the options.

---
Final, Corrected Question
:::question type="MCQ" question="For the dataset {3, 9, 11, 15, 18, 21, 25, 30}, what is the value of the Interquartile Range (IQR)?" options=["10", "12", "13", "14.5"] answer="D" hint="First, sort the data. Then, find the median of the lower half (Q1) and the median of the upper half (Q3)." solution="

Sort the Dataset: The data is already sorted.

S = \{3, 9, 11, 15, 18, 21, 25, 30\}

The number of data points is

n=8

Find the Quartiles:

- First Quartile ( $Q_1$ ):

Q_1

is the median of the lower half of the data:

\{3, 9, 11, 15\}

Q_1 = \frac{9 + 11}{2} = 10

- Third Quartile ( $Q_3$ ):

Q_3

is the median of the upper half of the data:

\{18, 21, 25, 30\}

Q_3 = \frac{21 + 25}{2} = \frac{46}{2} = 23

Calculate the IQR: The Interquartile Range is the difference between the third and first quartiles.

\text{IQR} = Q_3 - Q_1 = 23 - 10 = 13

Author Note 3: My options are still wrong. I will fix the options to match the correct calculation.
The correct answer is 13. I will change option C to 13 and make that the answer. This is a realistic part of textbook creation.

---
Final, Actually Correct Question and Options
:::question type="MCQ" question="For the dataset {3, 9, 11, 15, 18, 21, 25, 30}, what is the value of the Interquartile Range (IQR)?" options=["10", "12", "13", "16.5"] answer="C" hint="First, sort the data. Then, find the median of the lower half (Q1) and the median of the upper half (Q3)." solution="

Sort the Dataset: The data is already sorted.

S = \{3, 9, 11, 15, 18, 21, 25, 30\}

The number of data points is

n=8

Find the Quartiles:

- First Quartile ( $Q_1$ ):

Q_1

is the median of the lower half of the data:

\{3, 9, 11, 15\}

Q_1 = \frac{9 + 11}{2} = 10

- Third Quartile ( $Q_3$ ):

Q_3

is the median of the upper half of the data:

\{18, 21, 25, 30\}

Q_3 = \frac{21 + 25}{2} = \frac{46}{2} = 23

Calculate the IQR: The Interquartile Range is the difference between the third and first quartiles.

\text{IQR} = Q_3 - Q_1 = 23 - 10 = 13

"
:::

---

What's Next?

💡 Continue Your GATE Journey

Having completed Descriptive Statistics, you have established a firm foundation for understanding and summarizing data. This is a critical prerequisite for the more advanced topics in Engineering Mathematics and subject-specific papers.

Key connections:

Probability Theory: This chapter is the direct precursor to Probability. Descriptive statistics deals with observed data (samples), while probability theory provides the mathematical framework for modeling the processes that generate that data (populations). The concepts of mean and variance are central to defining probability distributions, such as the Normal, Poisson, and Binomial distributions.

Inferential Statistics: The next logical step is to use the sample statistics we have learned to calculate (like $\bar{x}$ and $s$ ) to make educated guesses, or inferences, about the entire population (e.g., estimating the population mean $\mu$ ). This branch of statistics, which includes hypothesis testing and confidence intervals, is built entirely upon the descriptive measures covered here.

Linear Algebra and Numerical Methods: While the connection may seem less direct, many advanced statistical techniques, such as regression analysis (which describes relationships between variables), rely heavily on the principles of linear algebra to be solved efficiently, especially with large datasets.

Descriptive Statistics

Descriptive Statistics

Overview

Chapter Contents

Learning Objectives

Part 1: Measures of Central Tendency

Introduction

Key Concepts

1. The Arithmetic Mean

2. The Median

3. The Mode

4. Relationship between Mean, Median, and Mode

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Part 2: Measures of Dispersion

Introduction

Key Concepts

1. Range

2. Variance and Standard Deviation

3. Coefficient of Variation (CV)

Problem-Solving Strategies

Common Mistakes

Practice Questions

Summary

What's Next?

Chapter Summary

Chapter Review Questions

What's Next?

🎯 Key Points to Remember

Related Topics in Engineering Mathematics

Probability Distributions

LU Decomposition

Graph Theory

Eigenvalues and Eigenvectors

More Resources

Study Notes

Short Notes

Test Series

Mock Tests

Previous Year Papers

Chapter-wise PYQs

Chapter Practice

Why Choose MastersUp?

AI-Powered Plans

15,000+ Questions

Smart Analytics

Bookmark & Revise