100% FREE Updated: Mar 2026 Probability and Statistics Descriptive Statistics

Data Interpretation and Summary Statistics

Comprehensive study notes on Data Interpretation and Summary Statistics for CMI Data Science preparation. This chapter covers key concepts, formulas, and examples needed for your exam.

Data Interpretation and Summary Statistics

Overview

Welcome to 'Data Interpretation and Summary Statistics', a foundational chapter for your Masters in Data Science journey at CMI. In the world of data science, the ability to transform raw, often overwhelming datasets into clear, actionable insights is paramount. This chapter will equip you with the essential tools and techniques to condense vast amounts of information into meaningful summaries, providing the first critical step towards understanding any dataset.

Mastering summary statistics and data interpretation is not just a theoretical exercise; it's a vital skill frequently tested in CMI examinations. You'll encounter scenarios requiring you to quickly assess data characteristics, identify patterns, detect anomalies, and draw robust conclusions from both numerical summaries and various data visualizations. A strong grasp of these concepts forms the bedrock for more advanced statistical modeling and machine learning topics, directly impacting your ability to solve complex data science problems effectively and efficiently.

Chapter Contents

| # | Topic | What You'll Learn |
|---|-------|-------------------|
| 1 | Summary Statistics | Quantify data characteristics using key metrics. |
| 2 | Data Interpretation | Extract insights from numerical and visual data. |

Learning Objectives

By the End of This Chapter

After studying this chapter, you will be able to:

  • Define, calculate, and interpret common measures of central tendency and dispersion.

  • Select appropriate summary statistics and graphical representations based on data type and distribution.

  • Critically interpret various data visualizations to identify trends, patterns, and outliers.

  • Formulate valid conclusions and communicate insights effectively from summarized and interpreted data.

Now let's begin with Summary Statistics...

Part 1: Summary Statistics

Introduction

Summary statistics are fundamental tools in data science, providing concise numerical and graphical descriptions of the main features of a dataset. They allow us to distill large volumes of data into understandable insights, revealing patterns, central tendencies, and variations. For the CMI exam, a strong grasp of summary statistics is crucial for interpreting data, making informed decisions, and understanding the foundational concepts of more advanced statistical analysis. This unit covers the key measures of central tendency, dispersion, and position, along with their calculation from various data types and their behavior under data modifications, which are frequently tested.
📖 Summary Statistics

Numerical or graphical values that condense the characteristics of a dataset, such as its central point, spread, and shape, into a few key figures. Examples include the meanmean, medianmedian, variancevariance, and percentilespercentiles.

---

Key Concepts

1. Measures of Central Tendency

Measures of central tendency aim to find a single value that represents the center or typical value of a dataset.

1.1 Arithmetic Mean

The arithmetic mean, often simply called the mean, is the sum of all values divided by the number of values. It is the most common measure of central tendency.

📐 Arithmetic Mean

For a dataset x1,x2,,xnx_1, x_2, \dots, x_n:

xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

For grouped data with frequencies fif_i for values xix_i:

xˉ=i=1kxifii=1kfi\bar{x} = \frac{\sum_{i=1}^{k} x_i f_i}{\sum_{i=1}^{k} f_i}

Variables:

    • xˉ\bar{x} = sample mean

    • nn = number of data points

    • xix_i = individual data point

    • kk = number of distinct values or classes

    • fif_i = frequency of xix_i


Application: When data is symmetrically distributed or when a precise average is needed. Sensitive to outliers.

Worked Example: Mean for Grouped Data

Problem: A survey recorded the number of online courses completed by students in a month.
| Courses Completed | Number of Students |
|-------------------|--------------------|
| 0 | 5 |
| 1 | 12 |
| 2 | 18 |
| 3 | 10 |
| 4 | 5 |
Calculate the mean number of courses completed.

Solution:

Step 1: Identify values (xix_i) and frequencies (fif_i) and calculate xifix_i f_i.

| xix_i | fif_i | xifix_i f_i |
|-----|-----|---------|
| 0 | 5 | 0 |
| 1 | 12 | 12 |
| 2 | 18 | 36 |
| 3 | 10 | 30 |
| 4 | 5 | 20 |

Step 2: Sum xifix_i f_i and fif_i.

xifi=0+12+36+30+20=98\sum x_i f_i = 0 + 12 + 36 + 30 + 20 = 98
fi=5+12+18+10+5=50\sum f_i = 5 + 12 + 18 + 10 + 5 = 50

Step 3: Apply the mean formula for grouped data.

xˉ=xififi=9850\bar{x} = \frac{\sum x_i f_i}{\sum f_i} = \frac{98}{50}

Step 4: Simplify.

xˉ=1.96\bar{x} = 1.96

Answer: 1.961.96 courses

---

1.2 Median

The median is the middle value of a dataset when it is ordered from least to greatest. It is less affected by outliers than the mean.

📖 Median

The middle value in an ordered dataset. If nn is odd, it's the (n+12)th\left(\frac{n+1}{2}\right)^{th} value. If nn is even, it's the average of the (n2)th\left(\frac{n}{2}\right)^{th} and (n2+1)th\left(\frac{n}{2}+1\right)^{th} values.

---

1.3 Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode if all values appear with the same frequency.

📖 Mode

The value(s) that occur with the highest frequency in a dataset.

---

---

#
## 2. Measures of Dispersion

Measures of dispersion quantify the spread or variability of data points around the central tendency.

#
### 2.1 Range

The range is the difference between the maximum and minimum values in a dataset. It is a simple but sensitive measure of spread.

📖 Range
Range=XmaxXminRange = X_{max} - X_{min}
Where XmaxX_{max} is the maximum value and XminX_{min} is the minimum value in the dataset.

---

#
### 2.2 Variance and Standard Deviation

Variance measures the average of the squared differences from the mean, providing a measure of how much data points deviate from the mean. The standard deviation is the square root of the variance, expressed in the same units as the data, making it more interpretable.

📐 Sample Variance

For a sample x1,x2,,xnx_1, x_2, \dots, x_n with mean xˉ\bar{x}:

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2

Alternative Formula for Calculation:

s2=1n1(i=1nxi2(i=1nxi)2n)s^2 = \frac{1}{n-1} \left( \sum_{i=1}^{n} x_i^2 - \frac{\left(\sum_{i=1}^{n} x_i\right)^2}{n} \right)

Variables:

    • s2s^2 = sample variance

    • nn = number of data points

    • xix_i = individual data point

    • xˉ\bar{x} = sample mean


Application: Widely used to quantify the spread of data. The n1n-1 in the denominator provides an unbiased estimate of the population variance.

📐 Sample Standard Deviation
s=s2=1n1i=1n(xixˉ)2s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}

Variables:

    • ss = sample standard deviation


Application: Provides a measure of spread in the original units of the data, making it easier to interpret than variance.

---

#
## 3. Measures of Position

Measures of position indicate the relative standing of a data value within the dataset.

#
### 3.1 Percentiles

Percentiles divide a dataset into 100 equal parts. The jthj^{th} percentile (PjP_j) is the value below which jj percent of the data falls.

📖 Percentile (uu^*)

For an ordered discrete dataset x(1),x(2),,x(n)x_{(1)}, x_{(2)}, \dots, x_{(n)}:

  • Calculate t=jn100t = \frac{j \cdot n}{100}.

  • Let kk be an integer such that kt<(k+1)k \le t < (k+1).

  • Let s=tks = t - k.

  • Then

u=x(k)+s(x(k+1)x(k))u^* = x_{(k)} + s \cdot (x_{(k+1)} - x_{(k)})

Note: If k=nk=n, x(n+1)x_{(n+1)} is defined as x(n)x_{(n)} to handle edge cases.

Must Remember

The median is the 50th50^{th} percentile (P50P_{50}). Quartiles are specific percentiles:

    • Q1=P25Q_1 = P_{25} (First Quartile)

    • Q2=P50Q_2 = P_{50} (Second Quartile, Median)

    • Q3=P75Q_3 = P_{75} (Third Quartile)

Worked Example: Percentile Calculation

Problem: Consider the following ordered dataset of student scores: 40,45,50,55,60,65,70,75,80,85,9040, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90. Calculate the 30th30^{th} percentile using the given formula.

Solution:

Step 1: Identify nn and jj.
n=11n = 11 (number of data points)
j=30j = 30 (for 30th30^{th} percentile)

Step 2: Calculate tt.

t=jn100=3011100=330100=3.3t = \frac{j \cdot n}{100} = \frac{30 \cdot 11}{100} = \frac{330}{100} = 3.3

Step 3: Determine kk and ss.
Since kt<(k+1)k \le t < (k+1), and t=3.3t = 3.3, then k=3k = 3.

s=tk=3.33=0.3s = t - k = 3.3 - 3 = 0.3

Step 4: Identify x(k)x_{(k)} and x(k+1)x_{(k+1)}.
x(k)=x(3)=50x_{(k)} = x_{(3)} = 50 (the 3rd3^{rd} value in the ordered dataset)
x(k+1)=x(4)=55x_{(k+1)} = x_{(4)} = 55 (the 4th4^{th} value in the ordered dataset)

Step 5: Apply the percentile formula.

u=x(k)+s(x(k+1)x(k))u^* = x_{(k)} + s \cdot (x_{(k+1)} - x_{(k)})

u=50+0.3(5550)u^* = 50 + 0.3 \cdot (55 - 50)

u=50+0.35u^* = 50 + 0.3 \cdot 5

u=50+1.5u^* = 50 + 1.5

u=51.5u^* = 51.5

Answer: The 30th30^{th} percentile is 51.551.5.

---

---

4. Impact of Data Modifications

Understanding how summary statistics change when data points are added, removed, or modified is critical.

Effect of Adding/Removing a Data Point on Mean and Variance

When a data point is added or removed, the mean and variance of the dataset will change.

    • Mean: Removing a value xremovedx_{removed} from a dataset of size nn with mean xˉold\bar{x}_{old} will result in a new mean:

    xˉnew=nxˉoldxremovedn1\bar{x}_{new} = \frac{n \cdot \bar{x}_{old} - x_{removed}}{n-1}

    If xremoved>xˉoldx_{removed} > \bar{x}_{old}, the new mean will be lower. If xremoved<xˉoldx_{removed} < \bar{x}_{old}, the new mean will be higher.
    • Variance: The change in variance is more complex. The sum of squared deviations (xixˉ)2\sum (x_i - \bar{x})^2 will change, and the denominator (n1n-1) also changes.

- If the removed value xremovedx_{removed} is an outlier (far from the mean), its removal will likely decrease the variance.
- If the removed value xremovedx_{removed} is close to the mean, its removal might increase the variance if it was helping to "anchor" the spread, or decrease it if the remaining points are more tightly clustered.
- A key observation: xremoved=0.9x_{removed} = -0.9 from xˉ=1.1\bar{x} = 1.1 suggests it is significantly smaller than the mean. Removing such a value would tend to pull the mean upwards and likely decrease the overall spread if it was an extreme low value.

---

5. Rates and Time-Series Statistics

These concepts are essential for analyzing changes over time and making predictions.

5.1 Percentage Change

Percentage change quantifies the relative change between an old value and a new value.

📐 Percentage Change
Percentage Change=New ValueOld ValueOld Value×100%\text{Percentage Change} = \frac{\text{New Value} - \text{Old Value}}{\text{Old Value}} \times 100\%

Variables:

    • New Value = Value after change

    • Old Value = Value before change


Application: Used to express relative increase or decrease. A negative result indicates a decrease.

Worked Example: Overall Percentage Decrease

Problem: Company A's revenue decreased from 500500 USD million to 400400 USD million. Company B's revenue decreased from 300300 USD million to 200200 USD million. Calculate the overall percentage decrease in revenue across both companies.

Solution:

Step 1: Calculate total pre-attack revenue.

Total Old Revenue=500+300=800 USD million\text{Total Old Revenue} = 500 + 300 = 800 \text{ USD million}

Step 2: Calculate total post-attack revenue.

Total New Revenue=400+200=600 USD million\text{Total New Revenue} = 400 + 200 = 600 \text{ USD million}

Step 3: Apply the percentage change formula.

Overall Percentage Decrease=Total New RevenueTotal Old RevenueTotal Old Revenue×100%\text{Overall Percentage Decrease} = \frac{\text{Total New Revenue} - \text{Total Old Revenue}}{\text{Total Old Revenue}} \times 100\%

Overall Percentage Decrease=600800800×100%\text{Overall Percentage Decrease} = \frac{600 - 800}{800} \times 100\%

Overall Percentage Decrease=200800×100%\text{Overall Percentage Decrease} = \frac{-200}{800} \times 100\%

Overall Percentage Decrease=0.25×100%\text{Overall Percentage Decrease} = -0.25 \times 100\%

Overall Percentage Decrease=25%\text{Overall Percentage Decrease} = -25\%

Answer: \boxed{25\% \text{ decrease}}

---

5.2 Growth Rate

The annual growth rate measures the percentage increase of a specific variable over a year.

📐 Annual Growth Rate
Annual Growth Rate=Current Year ValuePrevious Year ValuePrevious Year Value×100%\text{Annual Growth Rate} = \frac{\text{Current Year Value} - \text{Previous Year Value}}{\text{Previous Year Value}} \times 100\%

Variables:

    • Current Year Value = Value in the current year

    • Previous Year Value = Value in the previous year


Application: Used in time series analysis to track the rate of change of a variable.

---

---

5.3 Moving Averages

A moving average is a series of averages of different subsets of the full data set. A 3-year moving average, for example, averages data points over three consecutive years, then shifts one year forward and repeats. It helps smooth out short-term fluctuations and highlight longer-term trends.

📖 Moving Average

An average of a subset of data points over a specified period (e.g., 3-year, 5-year). It is calculated by taking the average of the data points for the first kk periods, then moving the window one period forward and calculating the average for the next kk periods, and so on.

Worked Example: 3-Year Moving Average of Growth Rate

Problem: Given the annual values: Year 1: 100, Year 2: 110, Year 3: 120, Year 4: 130, Year 5: 140.
Calculate the 3-year moving average of the annual growth rates.

Solution:

Step 1: Calculate annual growth rates.
Year 2 Growth Rate:

110100100×100%=10%\frac{110-100}{100} \times 100\% = 10\%

Year 3 Growth Rate:
120110110×100%9.09%\frac{120-110}{110} \times 100\% \approx 9.09\%

Year 4 Growth Rate:
130120120×100%8.33%\frac{130-120}{120} \times 100\% \approx 8.33\%

Year 5 Growth Rate:
140130130×100%7.69%\frac{140-130}{130} \times 100\% \approx 7.69\%

Step 2: Calculate the 3-year moving averages of these growth rates.
The first 3-year window for growth rates covers Year 2, 3, 4.
Moving Average 1 (for Year 2-4):

10%+9.09%+8.33%3=27.42%39.14%\frac{10\% + 9.09\% + 8.33\%}{3} = \frac{27.42\%}{3} \approx 9.14\%

The second 3-year window for growth rates covers Year 3, 4, 5.
Moving Average 2 (for Year 3-5):

9.09%+8.33%+7.69%3=25.11%38.37%\frac{9.09\% + 8.33\% + 7.69\%}{3} = \frac{25.11\%}{3} \approx 8.37\%

Answer: The 3-year moving averages of annual growth rates are approximately 9.14%9.14\% and 8.37%8.37\%.

---

Problem-Solving Strategies

💡 CMI Strategy
    • Read Carefully for Definitions: CMI questions sometimes provide specific definitions (e.g., for percentiles). Always use the definition provided in the question.
    • Organize Data: For complex calculations involving multiple categories or time points (like percentage change across companies, or moving averages), create tables to organize the data and intermediate calculations.
    • Check Units: Ensure consistency in units, especially for financial or physical measurements.
    • Understand Impact of Outliers: Remember that the mean is sensitive to outliers, while the median is robust. This can be crucial when comparing mean and median or analyzing data modifications.
    • Step-by-Step Derivations: For questions involving changes to mean/variance, write out the formulas for xi\sum x_i and xi2\sum x_i^2 for the original dataset, then adjust them for the new dataset before recalculating.

---

Common Mistakes

⚠️ Avoid These Errors
    • Confusing Sample vs. Population Variance: Using nn instead of n1n-1 in the denominator for sample variance.
✅ Always use n1n-1 for sample variance unless explicitly stated to calculate population variance.
    • Incorrect Percentile Calculation: Not ordering the data first, or misapplying the interpolation formula.
✅ Always sort data in ascending order. Carefully follow the given percentile formula step-by-step, especially the kk and ss calculations.
    • Simple Average for Percentage Change: Averaging individual percentage changes instead of calculating overall change from total initial and total final values.
✅ For overall percentage change across multiple entities, sum the initial values and sum the final values, then apply the percentage change formula to these totals.
    • Misinterpreting Mean and Median Relationship: Assuming mean > median always means positive skew. While generally true, small datasets or specific distributions can behave differently.
✅ Understand that for positively skewed data, Mean > Median > Mode, and for negatively skewed data, Mean < Median < Mode.
    • Ignoring the effect of removed points on variance: Assuming removing an outlier always decreases variance.
✅ Removing a value far from the mean generally decreases variance. Removing a value close to the mean might increase variance if the remaining data points are more spread out relative to the new mean. Always consider how the sum of squared deviations and the denominator change.

---

---

Practice Questions

:::question type="NAT" question="A dataset contains n=10n=10 observations. The sum of the observations is xi=150\sum x_i = 150, and the sum of their squares is xi2=2500\sum x_i^2 = 2500. If an observation xj=5x_j = 5 is removed from the dataset, what is the new sample variance of the remaining n1n-1 observations? (Round to two decimal places)" answer="17.36" hint="First calculate the original mean and variance. Then adjust the sum of observations and sum of squares for the removed point. Finally, calculate the new variance." solution="Step 1: Calculate the original sum of xix_i and xi2x_i^2.
Given: xi=150\sum x_i = 150, xi2=2500\sum x_i^2 = 2500, n=10n = 10.

Step 2: Remove the observation xj=5x_j = 5.
New sum of observations: xi=1505=145\sum x_i' = 150 - 5 = 145.
New sum of squares: xi2=250052=250025=2475\sum x_i^{2'} = 2500 - 5^2 = 2500 - 25 = 2475.
New number of observations: n=101=9n' = 10 - 1 = 9.

Step 3: Calculate the new sample mean xˉ\bar{x}'.

xˉ=xin=145916.111\bar{x}' = \frac{\sum x_i'}{n'} = \frac{145}{9} \approx 16.111

Step 4: Calculate the new sample variance s2s^{2'} using the computational formula.

s2=1n1(xi2(xi)2n)s^{2'} = \frac{1}{n'-1} \left( \sum x_i^{2'} - \frac{(\sum x_i')^2}{n'} \right)

s2=191(2475(145)29)s^{2'} = \frac{1}{9-1} \left( 2475 - \frac{(145)^2}{9} \right)

s2=18(2475210259)s^{2'} = \frac{1}{8} \left( 2475 - \frac{21025}{9} \right)

s2=18(24759210259)s^{2'} = \frac{1}{8} \left( \frac{2475 \cdot 9 - 21025}{9} \right)

s2=18(22275210259)s^{2'} = \frac{1}{8} \left( \frac{22275 - 21025}{9} \right)

s2=18(12509)s^{2'} = \frac{1}{8} \left( \frac{1250}{9} \right)

s2=125072=6253617.3611s^{2'} = \frac{1250}{72} = \frac{625}{36} \approx 17.3611

Rounding to two decimal places, the new sample variance is 17.3617.36.
Answer: \boxed{17.36}
"
:::

:::question type="MCQ" question="The following data represents the number of daily active users (in thousands) for a new social media platform over 10 days, sorted in ascending order: 10,12,15,18,20,22,25,28,30,3510, 12, 15, 18, 20, 22, 25, 28, 30, 35. Using the percentile formula u=x(k)+s(x(k+1)x(k))u^* = x_{(k)} + s \cdot (x_{(k+1)} - x_{(k)}) where t=jn100t = \frac{j \cdot n}{100}, kt<(k+1)k \le t < (k+1), and s=tks = t - k, what is the 70th70^{th} percentile?" options=["26.526.5 thousand users","2828 thousand users","27.527.5 thousand users","2525 thousand users"] answer="2525 thousand users" hint="First calculate tt, then identify kk and ss, and finally apply the given percentile formula." solution="Step 1: Identify nn and jj.
n=10n = 10 (number of data points)
j=70j = 70 (for 70th70^{th} percentile)

Step 2: Calculate tt.

t=jn100=7010100=700100=7t = \frac{j \cdot n}{100} = \frac{70 \cdot 10}{100} = \frac{700}{100} = 7

Step 3: Determine kk and ss.
The formula states kt<(k+1)k \le t < (k+1). Since t=7t=7, we have k=7k=7.

s=tk=77=0s = t - k = 7 - 7 = 0

Step 4: Identify x(k)x_{(k)} and x(k+1)x_{(k+1)}.
The ordered dataset is: 10,12,15,18,20,22,25,28,30,3510, 12, 15, 18, 20, 22, 25, 28, 30, 35.
x(k)=x(7)=25x_{(k)} = x_{(7)} = 25 (the 7th7^{th} value in the ordered dataset)
x(k+1)=x(8)=28x_{(k+1)} = x_{(8)} = 28 (the 8th8^{th} value in the ordered dataset)

Step 5: Apply the percentile formula.

u=x(k)+s(x(k+1)x(k))u^* = x_{(k)} + s \cdot (x_{(k+1)} - x_{(k)})

u=25+0(2825)u^* = 25 + 0 \cdot (28 - 25)

u=25+0u^* = 25 + 0

u=25u^* = 25

Following the given formula strictly, the 70th70^{th} percentile is 2525 thousand users.
Answer: \boxed{25 \text{ thousand users}}
"
:::

:::question type="MSQ" question="A company's quarterly profits (in million USD) for the past 5 quarters are: Q1:10,Q2:12,Q3:15,Q4:11,Q5:18Q1: 10, Q2: 12, Q3: 15, Q4: 11, Q5: 18. Which of the following statements are TRUE regarding the 3-quarter moving average of these profits and the impact of an error?" options=["The 3-quarter moving average for Q1-Q3 is 12.3312.33 million USD.","If Q5 was mistakenly recorded as 88 instead of 1818, the median profit would decrease.","The 3-quarter moving average for Q3-Q5 is 14.6714.67 million USD.","If Q1 was mistakenly recorded as 2020 instead of 1010, the mean profit would increase by 22 million USD."] answer="A,B,C,D" hint="Calculate moving averages and consider the impact of data changes on mean and median." solution="Let the profits be P=[10,12,15,11,18]P = [10, 12, 15, 11, 18].

Option A: The 3-quarter moving average for Q1-Q3 is 10+12+153=37312.33\frac{10+12+15}{3} = \frac{37}{3} \approx 12.33 million USD.
This statement is TRUE.

Option B: If Q5 was mistakenly recorded as 88 instead of 1818.
Original profits (ordered): 10,11,12,15,1810, 11, 12, 15, 18. Median = 1212.
New profits with Q5=8: 10,12,15,11,810, 12, 15, 11, 8.
Ordered new profits: 8,10,11,12,158, 10, 11, 12, 15. New median = 1111.
Since 11<1211 < 12, the median profit would decrease.
This statement is TRUE.

Option C: The 3-quarter moving average for Q3-Q5 is 15+11+183=44314.67\frac{15+11+18}{3} = \frac{44}{3} \approx 14.67 million USD.
This statement is TRUE.

Option D: If Q1 was mistakenly recorded as 2020 instead of 1010.
Original mean: 10+12+15+11+185=665=13.2\frac{10+12+15+11+18}{5} = \frac{66}{5} = 13.2 million USD.
New Q1: 2020. Other values same.
New mean: 20+12+15+11+185=765=15.2\frac{20+12+15+11+18}{5} = \frac{76}{5} = 15.2 million USD.
Increase in mean profit = 15.213.2=215.2 - 13.2 = 2 million USD.
This statement is TRUE.

All options are correct."
:::

:::question type="SUB" question="A retail chain has two stores, Store X and Store Y.
Store X's monthly sales decreased from 120120 thousand USD to 9090 thousand USD.
Store Y's monthly sales decreased from 180180 thousand USD to 135135 thousand USD.
Calculate the overall percentage decrease in sales across both stores combined for the month." answer="25%" hint="First find the total original sales and total new sales for both stores combined. Then apply the percentage change formula." solution="Step 1: Calculate total original sales for both stores.

Total Original Sales=Store X Original Sales+Store Y Original Sales\text{Total Original Sales} = \text{Store X Original Sales} + \text{Store Y Original Sales}

Total Original Sales=120,000+180,000=300,000 USD\text{Total Original Sales} = 120,000 + 180,000 = 300,000 \text{ USD}

Step 2: Calculate total new sales for both stores.

Total New Sales=Store X New Sales+Store Y New Sales\text{Total New Sales} = \text{Store X New Sales} + \text{Store Y New Sales}

Total New Sales=90,000+135,000=225,000 USD\text{Total New Sales} = 90,000 + 135,000 = 225,000 \text{ USD}

Step 3: Apply the percentage change formula.

Overall Percentage Decrease=Total New SalesTotal Original SalesTotal Original Sales×100%\text{Overall Percentage Decrease} = \frac{\text{Total New Sales} - \text{Total Original Sales}}{\text{Total Original Sales}} \times 100\%

Overall Percentage Decrease=225,000300,000300,000×100%\text{Overall Percentage Decrease} = \frac{225,000 - 300,000}{300,000} \times 100\%

Overall Percentage Decrease=75,000300,000×100%\text{Overall Percentage Decrease} = \frac{-75,000}{300,000} \times 100\%

Overall Percentage Decrease=0.25×100%\text{Overall Percentage Decrease} = -0.25 \times 100\%

Overall Percentage Decrease=25%\text{Overall Percentage Decrease} = -25\%

The overall percentage decrease is 25%25\%.
Answer: \boxed{25\%}
"
:::

:::question type="MCQ" question="A dataset of 8 values has a mean of 1515 and a variance of 2020. If a new data point with value 2525 is added to the dataset, what can be concluded about the new mean (xˉnew\bar{x}_{new}) and new variance (snew2s^2_{new})? (Assume sample variance formula n1n-1)" options=["xˉnew<15\bar{x}_{new} < 15 and snew2<20s^2_{new} < 20","xˉnew>15\bar{x}_{new} > 15 and snew2<20s^2_{new} < 20","xˉnew>15\bar{x}_{new} > 15 and snew2>20s^2_{new} > 20","xˉnew<15\bar{x}_{new} < 15 and snew2>20s^2_{new} > 20"] answer="xˉnew>15\bar{x}_{new} > 15 and snew2>20s^2_{new} > 20" hint="Calculate the original sum of xix_i and xi2\sum x_i^2. Then update these sums with the new data point and recalculate the mean and variance." solution="Step 1: Calculate original sum of observations and sum of squares.
Original n=8n=8, xˉold=15\bar{x}_{old}=15, sold2=20s^2_{old}=20.
Original sum of observations: xi=nxˉold=815=120\sum x_i = n \cdot \bar{x}_{old} = 8 \cdot 15 = 120.
Using the computational formula for variance: s2=1n1(xi2nxˉ2)s^2 = \frac{1}{n-1} \left( \sum x_i^2 - n \bar{x}^2 \right).
Rearranging for xi2\sum x_i^2:

xi2=(n1)s2+nxˉ2\sum x_i^2 = (n-1)s^2 + n\bar{x}^2

xi2=(81)20+8152\sum x_i^2 = (8-1) \cdot 20 + 8 \cdot 15^2

xi2=720+8225\sum x_i^2 = 7 \cdot 20 + 8 \cdot 225

xi2=140+1800=1940\sum x_i^2 = 140 + 1800 = 1940

Step 2: Add the new data point xnew=25x_{new}=25.
New nnew=8+1=9n_{new} = 8+1 = 9.
New sum of observations: xi,new=120+25=145\sum x_{i,new} = 120 + 25 = 145.
New sum of squares: xi,new2=1940+252=1940+625=2565\sum x_{i,new}^2 = 1940 + 25^2 = 1940 + 625 = 2565.

Step 3: Calculate the new mean.

xˉnew=xi,newnnew=145916.11\bar{x}_{new} = \frac{\sum x_{i,new}}{n_{new}} = \frac{145}{9} \approx 16.11

Since 16.11>1516.11 > 15, the new mean is greater than the old mean.

Step 4: Calculate the new variance.

snew2=1nnew1(xi,new2(xi,new)2nnew)s^2_{new} = \frac{1}{n_{new}-1} \left( \sum x_{i,new}^2 - \frac{(\sum x_{i,new})^2}{n_{new}} \right)

snew2=191(2565(145)29)s^2_{new} = \frac{1}{9-1} \left( 2565 - \frac{(145)^2}{9} \right)

snew2=18(2565210259)s^2_{new} = \frac{1}{8} \left( 2565 - \frac{21025}{9} \right)

snew2=18(25659210259)s^2_{new} = \frac{1}{8} \left( \frac{2565 \cdot 9 - 21025}{9} \right)

snew2=18(23085210259)s^2_{new} = \frac{1}{8} \left( \frac{23085 - 21025}{9} \right)

snew2=18(20609)s^2_{new} = \frac{1}{8} \left( \frac{2060}{9} \right)

snew2=20607228.61s^2_{new} = \frac{2060}{72} \approx 28.61

Since 28.61>2028.61 > 20, the new variance is greater than the old variance.

Therefore, xˉnew>15\bar{x}_{new} > 15 and snew2>20s^2_{new} > 20.
Answer: \boxed{\bar{x}_{new} > 15 \text{ and } s^2_{new} > 20}
"
:::

:::question type="NAT" question="A company's annual revenue (in million USD) for 5 years is: Y1:50,Y2:55,Y3:60,Y4:66,Y5:72Y1: 50, Y2: 55, Y3: 60, Y4: 66, Y5: 72. Calculate the average of all available 3-year moving averages of the annual growth rate (as a percentage, rounded to two decimal places)." answer="9.55" hint="First calculate the annual growth rate for each year from Y2 to Y5. Then calculate the 3-year moving averages of these growth rates. Finally, average those moving averages." solution="Step 1: Calculate annual growth rates.
Growth Rate (Y2): 555050×100%=550×100%=10%\frac{55-50}{50} \times 100\% = \frac{5}{50} \times 100\% = 10\%
Growth Rate (Y3): 605555×100%=555×100%9.0909%\frac{60-55}{55} \times 100\% = \frac{5}{55} \times 100\% \approx 9.0909\%
Growth Rate (Y4): 666060×100%=660×100%=10%\frac{66-60}{60} \times 100\% = \frac{6}{60} \times 100\% = 10\%
Growth Rate (Y5): 726666×100%=666×100%9.0909%\frac{72-66}{66} \times 100\% = \frac{6}{66} \times 100\% \approx 9.0909\%

Step 2: Calculate 3-year moving averages of growth rates.
The growth rates are for Y2, Y3, Y4, Y5.
Moving Average 1 (Y2-Y4): 10%+9.0909%+10%3=29.0909%39.697%\frac{10\% + 9.0909\% + 10\%}{3} = \frac{29.0909\%}{3} \approx 9.697\%
Moving Average 2 (Y3-Y5): 9.0909%+10%+9.0909%3=28.1818%39.394%\frac{9.0909\% + 10\% + 9.0909\%}{3} = \frac{28.1818\%}{3} \approx 9.394\%

Step 3: Calculate the average of all available 3-year moving averages.

Average of Moving Averages=9.697%+9.394%2\text{Average of Moving Averages} = \frac{9.697\% + 9.394\%}{2}

Average of Moving Averages=19.091%29.5455%\text{Average of Moving Averages} = \frac{19.091\%}{2} \approx 9.5455\%

Using fractions for precision:
Growth Rate (Y2): 550=110\frac{5}{50} = \frac{1}{10}
Growth Rate (Y3): 555=111\frac{5}{55} = \frac{1}{11}
Growth Rate (Y4): 660=110\frac{6}{60} = \frac{1}{10}
Growth Rate (Y5): 666=111\frac{6}{66} = \frac{1}{11}

MA1 (Y2-Y4): 13(110+111+110)=13(11+10+11110)=13(32110)=32330=16165\frac{1}{3} \left( \frac{1}{10} + \frac{1}{11} + \frac{1}{10} \right) = \frac{1}{3} \left( \frac{11+10+11}{110} \right) = \frac{1}{3} \left( \frac{32}{110} \right) = \frac{32}{330} = \frac{16}{165}
MA2 (Y3-Y5): 13(111+110+111)=13(10+11+10110)=13(31110)=31330\frac{1}{3} \left( \frac{1}{11} + \frac{1}{10} + \frac{1}{11} \right) = \frac{1}{3} \left( \frac{10+11+10}{110} \right) = \frac{1}{3} \left( \frac{31}{110} \right) = \frac{31}{330}

Average of MAs: 12(16165+31330)=12(32330+31330)=12(63330)=63660=21220\frac{1}{2} \left( \frac{16}{165} + \frac{31}{330} \right) = \frac{1}{2} \left( \frac{32}{330} + \frac{31}{330} \right) = \frac{1}{2} \left( \frac{63}{330} \right) = \frac{63}{660} = \frac{21}{220}
As a percentage: 21220×100%=2100220%=10511%9.545454...%\frac{21}{220} \times 100\% = \frac{2100}{220}\% = \frac{105}{11}\% \approx 9.545454...\%
Rounding to two decimal places, the average of all available 3-year moving averages of the annual growth rate is 9.55%9.55\%.
Answer: \boxed{9.55}
"
:::

---

Summary

Key Takeaways for CMI

  • Measures of Central Tendency: Understand mean, median, and mode, their calculation (especially for grouped data), and their sensitivity to outliers. The median is robust, while the mean is sensitive.

  • Measures of Dispersion: Know how to calculate variance and standard deviation using the correct formulas (sample vs. population), and interpret their meaning regarding data spread.

  • Measures of Position: Master the calculation of percentiles using the provided interpolation formula, and recognize that median is P50P_{50}.

  • Impact of Data Changes: Be able to quantify how adding or removing data points affects the mean and variance, and understand the general direction of these changes.

  • Time Series Analysis Basics: Calculate percentage change, annual growth rates, and moving averages to analyze trends and make simple forecasts.

---

What's Next?

💡 Continue Learning

This topic connects to:

    • Probability Distributions: Summary statistics are used to describe parameters of distributions (e.g., mean and variance of a normal distribution).

    • Hypothesis Testing: Many tests rely on sample means and variances to infer about population parameters.

    • Regression Analysis: Descriptive statistics are crucial for initial data exploration and understanding variable relationships before modeling.

    • Data Visualization: Summary statistics often inform the choice and interpretation of plots like box plots (which show quartiles and median) and histograms (which show distribution shape).


Master these connections for comprehensive CMI preparation!

---

💡 Moving Forward

Now that you understand Summary Statistics, let's explore Data Interpretation which builds on these concepts.

---

Part 2: Data Interpretation

Introduction

Data Interpretation is a critical skill for a Masters in Data Science, especially in competitive examinations like CMI. It involves the ability to analyze and derive meaningful insights from various forms of data presentations such as tables, charts, and graphs. This topic assesses not only your quantitative aptitude but also your logical reasoning and attention to detail.

In CMI, Data Interpretation questions often present real-world scenarios, requiring you to extract, process, and synthesize information from multiple data sources to answer specific questions. Mastering this unit is essential for accurately and efficiently solving complex problems under exam conditions.

📖 Data Interpretation

Data Interpretation is the process of reviewing data through some predefined processes, understanding its meaning, and then drawing conclusions based on the insights derived from the data. It involves transforming raw data into actionable information by employing analytical and statistical tools.

---

Key Concepts

1. Reading and Interpreting Tabular Data

Tables are structured arrays of data, organized into rows and columns, providing precise numerical information. They are fundamental for presenting detailed datasets.

Key aspects:
* Rows and Columns: Understand what each row and column represents.
* Headers: Pay close attention to column and row headers for context.
* Units: Always note the units of measurement (e.g., Rupees Crores, Lakhs of Rupees, percentage).
* Totals and Subtotals: Identify if totals or subtotals are provided, or if they need to be calculated.

Worked Example:

Problem:
A company's quarterly sales data (in thousands of units) for three products (P1, P2, P3) is given below.

ProductQ1Q2Q3Q4hlineP1150180200170hlineP2120130110140hlineP38090100110hline\begin{array}{|c|c|c|c|c|}\hline\textbf{Product} & \textbf{Q1} & \textbf{Q2} & \textbf{Q3} & \textbf{Q4}\\hline \text{P1} & 150 & 180 & 200 & 170\\hline \text{P2} & 120 & 130 & 110 & 140\\hline \text{P3} & 80 & 90 & 100 & 110\\hline\end{array}

Calculate the total sales of Product P2 for the entire year.

Solution:

Step 1: Identify the relevant row for Product P2.

The sales for Product P2 are given in the second row.

P2 Sales (Q1)=120\text{P2 Sales (Q1)} = 120
P2 Sales (Q2)=130\text{P2 Sales (Q2)} = 130
P2 Sales (Q3)=110\text{P2 Sales (Q3)} = 110
P2 Sales (Q4)=140\text{P2 Sales (Q4)} = 140

Step 2: Sum the quarterly sales for Product P2.

Total P2 Sales=120+130+110+140\text{Total P2 Sales} = 120 + 130 + 110 + 140
Total P2 Sales=500\text{Total P2 Sales} = 500

Answer: \boxed{500 \text{ thousand units}}

---

2. Interpreting Bar Charts

Bar charts use rectangular bars of varying heights or lengths to represent data, making comparisons between different categories easy.

Types of Bar Charts:
* Single Bar Chart: Displays one data series for various categories.
* Grouped Bar Chart: Compares multiple data series for each category, with bars grouped together.
* Stacked Bar Chart: Shows components of a whole for each category, with bars stacked on top of each other. The total height of the bar represents the sum of the components.

Key aspects:
* Axes: Understand what the X-axis (categories) and Y-axis (values/quantities) represent.
* Scale: Note the increments and range of the value axis.
* Labels: Read labels carefully for each bar or group of bars.
* Legend: For grouped or stacked bar charts, the legend is crucial to identify which bar/segment corresponds to which data series.

Worked Example (Grouped Bar Chart):

Problem:
A grouped bar chart shows the number of male and female employees in different departments (A, B, C).




0
10
20
30
40
Number of Employees


Dept A
Dept B
Dept C
Department




25

20



15

30



35

10




Male

Female

What is the total number of employees in Department B?

Solution:

Step 1: Locate Department B on the X-axis.

Step 2: Identify the bars corresponding to Department B and read their values from the Y-axis (or value labels).

Male employees in Dept B=15\text{Male employees in Dept B} = 15
Female employees in Dept B=30\text{Female employees in Dept B} = 30

Step 3: Sum the values for Department B.

Total employees in Dept B=15+30\text{Total employees in Dept B} = 15 + 30
Total employees in Dept B=45\text{Total employees in Dept B} = 45

Answer: \boxed{45} employees

---

3. Interpreting Pie Charts

Pie charts represent parts of a whole, showing how a total quantity is divided among different categories. Each slice's size is proportional to the percentage it represents.

Key aspects:
* Total Value: The sum of all segments is 100%100\%.
* Percentages/Degrees: Values are usually given as percentages. If degrees are given, remember that 360360^\circ represents 100%100\%.
* Labels: Each slice is labeled with its category and usually its percentage.
* Context: A pie chart alone doesn't give absolute values; often, it's combined with other data (e.g., a total value) to find exact quantities.

Worked Example:

Problem:
A pie chart shows the market share of different smartphone brands. If Brand X has a 30%30\% market share and the total market for smartphones is 500500 million units, how many units did Brand X sell?

Solution:

Step 1: Identify the total market size and Brand X's market share.

Total Market=500 million units\text{Total Market} = 500 \text{ million units}
Brand X Market Share=30%\text{Brand X Market Share} = 30\%

Step 2: Calculate the number of units sold by Brand X.

Units sold by Brand X=Total Market×Brand X Market Share (as a decimal)\text{Units sold by Brand X} = \text{Total Market} \times \text{Brand X Market Share (as a decimal)}
Units sold by Brand X=500×0.30\text{Units sold by Brand X} = 500 \times 0.30
Units sold by Brand X=150\text{Units sold by Brand X} = 150

Answer: \boxed{150} million units

---

4. Working with Combined Data Displays

CMI often presents questions that require synthesizing information from two or more different data displays (e.g., a table and a bar chart, or a pie chart and a bar chart). This tests the ability to connect different pieces of information.

Key aspects:
* Identify Common Elements: Look for common categories or metrics that link the different charts.
* Sequential Information Flow: Often, one chart provides a total or percentage breakdown, and another provides detail for a specific segment of that total.
* Step-by-Step Calculation: Break down complex problems into smaller, manageable steps, moving between charts as needed.

Worked Example:

Problem:
A pie chart shows the distribution of a company's total budget (10001000 Crore) across departments: Marketing (20%20\%), R&D (30%30\%), Operations (40%40\%), and Admin (10%10\%). A bar chart then shows the actual expenditure of the Marketing department across four quarters (Q1: 3030 Crore, Q2: 4040 Crore, Q3: 5050 Crore, Q4: 6060 Crore). What percentage of the total company budget was spent by the Marketing department in Q1?

Solution:

Step 1: Calculate the total budget allocated to the Marketing department from the pie chart.

Total Company Budget=1000 Crore\text{Total Company Budget} = \text{₹}1000 \text{ Crore}
Marketing Budget Share=20%\text{Marketing Budget Share} = 20\%
Marketing Budget=1000×0.20\text{Marketing Budget} = 1000 \times 0.20
Marketing Budget=200 Crore\text{Marketing Budget} = \text{₹}200 \text{ Crore}

Step 2: Identify the Marketing department's expenditure in Q1 from the bar chart.

Marketing Expenditure in Q1=30 Crore\text{Marketing Expenditure in Q1} = \text{₹}30 \text{ Crore}

Step 3: Calculate the Q1 Marketing expenditure as a percentage of the total company budget.

Percentage=Marketing Expenditure in Q1Total Company Budget×100%\text{Percentage} = \frac{\text{Marketing Expenditure in Q1}}{\text{Total Company Budget}} \times 100\%
Percentage=301000×100%\text{Percentage} = \frac{30}{1000} \times 100\%
Percentage=3%\text{Percentage} = 3\%

Answer: \boxed{3\%}

---

---

#
## 5. Calculations: Percentages, Ratios, Averages, Rates of Change

These are the core mathematical operations applied to extracted data.

#
### a. Percentage Calculations

📐 Percentage of a Total
Percentage=PartWhole×100%\text{Percentage} = \frac{\text{Part}}{\text{Whole}} \times 100\%
Variables:
    • Part\text{Part} = The specific value or quantity
    • Whole\text{Whole} = The total value or quantity
When to use: To find what proportion a part is of a total.
📐 Percentage Increase/Decrease
Percentage Change=New ValueOld ValueOld Value×100%\text{Percentage Change} = \frac{|\text{New Value} - \text{Old Value}|}{\text{Old Value}} \times 100\%
Variables:
    • New Value\text{New Value} = The value after change
    • Old Value\text{Old Value} = The initial value
When to use: To quantify the relative change between two values.

#
### b. Ratios and Proportions

📖 Ratio

A ratio is a comparison of two quantities of the same unit, expressed as a:ba:b or ab\frac{a}{b}.

📖 Proportion

A proportion is a statement that two ratios are equal, e.g., ab=cd\frac{a}{b} = \frac{c}{d}.

Application: Often used to distribute a total quantity based on given ratios or to infer values in one category based on known values in another, assuming proportionality.

#
### c. Averages

📐 Arithmetic Mean (Simple Average)
Mean=i=1nxin\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}
Variables:
    • xix_i = individual data points
    • nn = number of data points
When to use: To find a central value for a set of numbers.
📐 Weighted Average
Weighted Average=i=1nwixii=1nwi\text{Weighted Average} = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}
Variables:
    • xix_i = individual data points
    • wiw_i = weights corresponding to each data point
When to use: When different data points contribute differently to the overall average.

Example: Calculating overall outage percentage where different servers have different usage times and individual outage rates.

#
### d. Rate of Change

This is essentially percentage change over time or across categories.

Worked Example (Percentage Increase):

Problem:
Sales of a product increased from 150150 units in January to 180180 units in February. What is the percentage increase in sales?

Solution:

Step 1: Identify the old value and the new value.

Old Value=150\text{Old Value} = 150
New Value=180\text{New Value} = 180

Step 2: Apply the percentage increase formula.

Percentage Increase=New ValueOld ValueOld Value×100%=180150150×100%=30150×100%=15×100%=20%\begin{aligned}\text{Percentage Increase} & = \frac{\text{New Value} - \text{Old Value}}{\text{Old Value}} \times 100\% \\ & = \frac{180 - 150}{150} \times 100\% \\ & = \frac{30}{150} \times 100\% \\ & = \frac{1}{5} \times 100\% \\ & = 20\%\end{aligned}

Answer: \boxed{20\%}

---

#
## 6. Time-Based Data Analysis

This involves interpreting data that changes over time, often presented in line graphs or bar charts with a time axis.

#
### a. Simple Interest

📐 Simple Interest
I=P×R×TI = P \times R \times T
Variables:
    • II = Simple Interest
    • PP = Principal amount
    • RR = Annual interest rate (as a decimal)
    • TT = Time in years
When to use: To calculate interest for loans or investments where interest is only on the principal amount.

Application: In CMI, you might be given interest rates over different years and need to calculate total interest paid for fixed-rate vs. variable-rate loans over multiple periods (as seen in PYQ 6).

#
### b. Time Zones

Understanding time zones is crucial when dealing with schedules or events spanning different geographical locations.

Key concepts:
* Local Time: The time at a specific location.
* Time Difference: The fixed difference in hours/minutes between two time zones.
* Calculating Actual Travel Time: To find the true duration of a journey across time zones, you must account for the time difference.
* If traveling from West to East (gaining time): Arrival Local Time - Departure Local Time - Time Difference = Actual Travel Time.
* If traveling from East to West (losing time): Arrival Local Time - Departure Local Time + Time Difference = Actual Travel Time.
* Alternatively, convert both departure and arrival times to a single reference time zone before calculating duration.

Example (PYQ 20 concept): If a train departs City A at 08:00 local time and arrives at City B at 10:00 local time, and City B is 1 hour ahead of City A, the actual travel time is:
* Departure in City B time: 08:00 + 1 hour = 09:00
* Actual travel time: 10:00 (arrival) - 09:00 (adjusted departure) = 1 hour.
* The difference in local times for the same duration indicates the time zone difference.

---

#
## 7. Logical Deduction in Data

Some problems require more than direct calculation; they involve logical reasoning, filling in missing information based on given constraints, or determining maximum/minimum possible values.

Key aspects:
* Constraints: Carefully read all conditions and rules provided in the problem description.
* Trial and Error / Systematic Approach: For problems with missing data, try to deduce values that satisfy all conditions.
* Optimization: When asked for maximum or minimum values, consider extreme scenarios within the given constraints.

Example (PYQ 18 concept): If ratings must be integers between 1 and 5, and no two parameters can have the same rating in four or more parameters, this imposes strict rules on how missing values can be filled. To maximize an average, you'd assign the highest possible ratings (5) to unknown parameters, ensuring all constraints are met.

---

Problem-Solving Strategies

💡 CMI Strategy

  • Understand the Question First: Before diving into data, read the question thoroughly to know what specific information you need to extract.

  • Identify Relevant Data: Pinpoint which chart(s), tables, rows, or columns contain the necessary data. Ignore irrelevant information.

  • Note Units and Scale: Always check the units (e.g., millions, lakhs, percentage points) and the scale of the axes. A common mistake is misinterpreting scales.

  • Break Down Complex Problems: For multi-step questions, break them into smaller, manageable calculations.

  • Estimate Before Calculating: For MCQs, sometimes a quick estimation can eliminate options or guide your precise calculation.

  • Use Annotations: Mark up charts or tables (mentally or on scratch paper) with relevant values to avoid re-reading.

  • Be Mindful of "Percentage Point" vs. "Percentage": A change from 10% to 12% is a 2 percentage point increase, but a 20% increase ((1210)/10×100%(12-10)/10 \times 100\%).

  • Proportionality Assumption: If not explicitly stated, do not assume distributions are uniform or proportional across categories unless there's a clear indication (like "same proportion across states").

  • Time Zone Conversion: When dealing with time-based data across different locations, always convert times to a common reference time zone to calculate actual durations.

---

Common Mistakes

⚠️ Avoid These Errors
    • Misreading Axes/Labels: Interpreting a bar's height against the wrong scale or misidentifying a category.
Correct: Always double-check axis labels, units, and legends before extracting any value.
    • Confusing Absolute and Relative Values: Mixing up raw numbers with percentages or ratios.
Correct: Clearly distinguish between counts, amounts, and their proportional representations.
    • Incorrect Percentage Calculations: Using the wrong base for percentage increase/decrease or calculating percentage points instead of percentage change.
Correct: Always use the 'Old Value' as the denominator for percentage change. Understand the difference between ChangeOriginal×100%\frac{\text{Change}}{\text{Original}} \times 100\% and New PercentageOld Percentage\text{New Percentage} - \text{Old Percentage}.
    • Ignoring Constraints/Conditions: Overlooking specific rules or conditions provided in the problem description, especially in logical deduction questions.
Correct: Underline or highlight all stated conditions.
    • Calculation Errors: Simple arithmetic mistakes due to haste.
Correct: Take your time with calculations, especially sums and multiplications involving large numbers or decimals. Use approximation for quick checks.
    • Assuming Proportionality: Assuming that if one segment (e.g., grey cars) is distributed in a certain way across cities, other segments (e.g., red cars) follow the exact same distribution, unless explicitly stated.
Correct: Only assume proportionality if the question directly states it or provides data that implies it.
    • Time Zone Miscalculation: Incorrectly adding or subtracting time differences when calculating travel durations.
Correct: Convert all times to a single reference time zone (e.g., UTC or one of the city's local times) before calculating durations.

---

Practice Questions

:::question type="NAT" question="A company's sales data for Product A over four quarters is given in the table below (in thousands of units).

ProductQ1Q2Q3Q4hlineA220250200280hlineB180200190210hlineC100120130150hline\begin{array}{|c|c|c|c|c|}\hline\textbf{Product} & \textbf{Q1} & \textbf{Q2} & \textbf{Q3} & \textbf{Q4}\\hline \text{A} & 220 & 250 & 200 & 280\\hline \text{B} & 180 & 200 & 190 & 210\\hline \text{C} & 100 & 120 & 130 & 150\\hline\end{array}

What was the percentage increase in sales of Product A from Q3 to Q4? (Round to one decimal place if necessary)" answer="40.0" hint="Calculate the difference between Q4 and Q3 sales for Product A, then divide by Q3 sales and multiply by 100." solution="Step 1: Identify sales of Product A in Q3 and Q4.
SalesA, Q3=200 thousand units\text{Sales}_{\text{A, Q3}} = 200 \text{ thousand units}

SalesA, Q4=280 thousand units\text{Sales}_{\text{A, Q4}} = 280 \text{ thousand units}

Step 2: Calculate the percentage increase.

Percentage Increase=SalesA, Q4SalesA, Q3SalesA, Q3×100%=280200200×100%=80200×100%=0.4×100%=40%\begin{aligned}\text{Percentage Increase} & = \frac{\text{Sales}_{\text{A, Q4}} - \text{Sales}_{\text{A, Q3}}}{\text{Sales}_{\text{A, Q3}}} \times 100\% \\
& = \frac{280 - 200}{200} \times 100\% \\
& = \frac{80}{200} \times 100\% \\
& = 0.4 \times 100\% \\
& = 40\%\end{aligned}

Answer: \boxed{40\%}"
:::

:::question type="MCQ" question="The following pie chart shows the distribution of students by their chosen major in a university.


Student Major Distribution




CS (35%)



Eng (25%)



Bus (20%)



Arts (10%)



Sci (10%)


If there are 4000 students in total, how many students are majoring in Business or Arts?" options=["800","1000","1200","1400"] answer="1200" hint="First, find the combined percentage for Business and Arts. Then, calculate that percentage of the total number of students." solution="Step 1: Identify the percentages for Business and Arts majors.

Business Major Percentage=20%\text{Business Major Percentage} = 20\%

Arts Major Percentage=10%\text{Arts Major Percentage} = 10\%

Step 2: Calculate the combined percentage for Business and Arts.

Combined Percentage=20%+10%=30%\begin{aligned}\text{Combined Percentage} & = 20\% + 10\% \\
& = 30\%\end{aligned}

Step 3: Calculate the number of students majoring in Business or Arts.

Total Students=4000\text{Total Students} = 4000

Number of Students (Business or Arts)=4000×0.30=1200\begin{aligned}\text{Number of Students (Business or Arts)} & = 4000 \times 0.30 \\
& = 1200\end{aligned}

Answer: \boxed{1200}"
:::

:::question type="SUB" question="A company's IT department has three servers: S1, S2, and S3. Their uptime (percentage of total operational time) and the number of incidents reported per server are given below:

ServerUptime (%)Incidents ReportedhlineS198%120hlineS295%150hlineS399%80hline\begin{array}{|c|c|c|}\hline\textbf{Server} & \textbf{Uptime (\%)} & \textbf{Incidents Reported}\\hline \text{S1} & 98\% & 120\\hline \text{S2} & 95\% & 150\\hline \text{S3} & 99\% & 80\\hline\end{array}

If Server S1 was operational for 5000 hours in total, calculate the total number of hours Server S2 was down (non-operational)." answer="125.0" hint="First, find the total operational time for S2 based on the ratio of incidents or by finding the total 'uptime' hours. Then calculate the downtime." solution="Step 1: Calculate S1's downtime hours.
Total Operational Time for S1=5000 hours\text{Total Operational Time for S1} = 5000 \text{ hours}

S1 Uptime=98%\text{S1 Uptime} = 98\%

S1 Downtime Percentage=100%98%=2%\text{S1 Downtime Percentage} = 100\% - 98\% = 2\%

S1 Downtime Hours=5000×0.02=100 hours\begin{aligned}\text{S1 Downtime Hours} & = 5000 \times 0.02 \\
& = 100 \text{ hours}\end{aligned}

Step 2: Assume the number of incidents reported is proportional to the downtime hours for each server.

S1 Downtime HoursIncidents Reported for S1=S2 Downtime HoursIncidents Reported for S2\frac{\text{S1 Downtime Hours}}{\text{Incidents Reported for S1}} = \frac{\text{S2 Downtime Hours}}{\text{Incidents Reported for S2}}

100120=S2 Downtime Hours150\frac{100}{120} = \frac{\text{S2 Downtime Hours}}{150}

Step 3: Solve for S2 Downtime Hours.

S2 Downtime Hours=100120×150=56×150=5×25=125 hours\begin{aligned}\text{S2 Downtime Hours} & = \frac{100}{120} \times 150 \\
& = \frac{5}{6} \times 150 \\
& = 5 \times 25 \\
& = 125 \text{ hours}\end{aligned}

Answer: \boxed{125.0}"
:::

---

Chapter Summary

📖 Data Interpretation and Summary Statistics - Key Takeaways

Here are the 5-7 most important points from this chapter that students must remember for CMI:

  • Understand Data Types and Scales: Differentiate between qualitative (nominal, ordinal) and quantitative (interval, ratio, discrete, continuous) data. This dictates which summary statistics and visualizations are appropriate.

  • Master Measures of Central Tendency: Know how to calculate and interpret the Mean, Median, and Mode. Understand their properties, especially how outliers affect the mean versus the median, and when each measure is most representative (e.g., median for skewed data, mean for symmetric data).

  • Grasp Measures of Dispersion: Comprehend the importance of Range, Variance, Standard Deviation, and Interquartile Range (IQR) in quantifying data spread. A smaller standard deviation or IQR indicates more consistent data.

  • Interpret Data Visualizations: Be proficient in interpreting common charts like Histograms, Box Plots, Bar Charts, and Pie Charts. Extract information about data distribution (shape, skewness, modality), central tendency, spread, and potential outliers from these visuals.

  • Recognize Skewness and Kurtosis: Qualitatively identify skewness (asymmetry) from histograms or the relationship between mean and median (e.g., Mean > Median for right-skewed). Understand that kurtosis describes the "tailedness" of a distribution relative to a normal distribution.

  • Percentiles and Quartiles: Understand that percentiles divide data into 100 equal parts and quartiles divide data into four equal parts. Know how to calculate and interpret Q1Q_1, Q2Q_2 (Median), Q3Q_3, and the IQR, which is a robust measure of spread.

  • Context is Key: Always consider the context of the data and the purpose of the analysis when choosing and interpreting summary statistics. No single statistic tells the whole story.

---

Chapter Review Questions

:::question type="MCQ" question="A researcher collected data on the monthly income (in thousands of INR) of 100 households in a particular locality. The distribution of incomes was found to be highly right-skewed. Which of the following statements is most likely true regarding the relationship between the mean, median, and mode of this income distribution?" options=["Mean < Median < Mode","Mean = Median = Mode","Mean > Median > Mode","The relationship cannot be determined without specific values"] answer="C" hint="Think about how outliers (high income values in this case) pull the mean in a skewed distribution." solution="For a distribution that is right-skewed (or positively skewed), the tail of the distribution extends to the right. This means there are a few unusually high values that pull the mean towards the right (higher values). The mode will be at the peak of the distribution (most frequent value), and the median will be between the mode and the mean.

Therefore, for a right-skewed distribution, the relationship is typically:

Mode<Median<Mean\text{Mode} < \text{Median} < \text{Mean}

Option C, Mean > Median > Mode, correctly represents this relationship.

Answer: \boxed{C}"
:::

:::question type="NAT" question="Consider the dataset: X={5,8,10,12,15}X = \{5, 8, 10, 12, 15\}. Calculate the population variance (σ2\sigma^2)." answer="11.6" hint="First, calculate the mean of the dataset. Then, find the squared difference of each value from the mean, sum them up, and divide by the number of observations." solution="To calculate the population variance (σ2\sigma^2) for the dataset X={5,8,10,12,15}X = \{5, 8, 10, 12, 15\}:

  • Calculate the Mean (μ\mu):

  • μ=5+8+10+12+155=505=10\mu = \frac{5 + 8 + 10 + 12 + 15}{5} = \frac{50}{5} = 10

  • Calculate the squared difference of each data point from the mean:

  • * (510)2=(5)2=25(5 - 10)^2 = (-5)^2 = 25
    * (810)2=(2)2=4(8 - 10)^2 = (-2)^2 = 4
    * (1010)2=(0)2=0(10 - 10)^2 = (0)^2 = 0
    * (1210)2=(2)2=4(12 - 10)^2 = (2)^2 = 4
    * (1510)2=(5)2=25(15 - 10)^2 = (5)^2 = 25

  • Sum these squared differences:

  • (xiμ)2=25+4+0+4+25=58\sum (x_i - \mu)^2 = 25 + 4 + 0 + 4 + 25 = 58

  • Divide by the number of observations (NN):

  • σ2=(xiμ)2N=585=11.6\begin{aligned} \sigma^2 & = \frac{\sum (x_i - \mu)^2}{N} \\
    & = \frac{58}{5} \\
    & = 11.6\end{aligned}

    Answer: \boxed{11.6}"
    :::

    :::question type="MCQ" question="Two companies, A and B, produce light bulbs. A sample of 100 bulbs from each company was tested for their lifespan (in hours). The summary statistics are given below:

    | Statistic | Company A | Company B |
    | :--------------- | :-------- | :-------- |
    | Mean Lifespan | 1200 hrs | 1250 hrs |
    | Median Lifespan | 1190 hrs | 1200 hrs |
    | Standard Deviation | 50 hrs | 150 hrs |
    | Interquartile Range| 70 hrs | 200 hrs |

    Based on these statistics, which of the following conclusions is most appropriate?" options=["Company A's bulbs are, on average, more durable than Company B's bulbs.","Company B's bulbs have a more consistent lifespan than Company A's bulbs.","Company A's bulbs show less variability in lifespan compared to Company B's bulbs.","Both companies have a symmetric distribution of bulb lifespans." ] answer="C" hint="Focus on measures of central tendency for 'average durability' and measures of dispersion for 'consistency' or 'variability'." solution="Let's analyze each option:

    * Company A's bulbs are, on average, more durable than Company B's bulbs.
    * Company A's Mean Lifespan = 1200 hrs.
    * Company B's Mean Lifespan = 1250 hrs.
    * Company B has a higher mean lifespan, suggesting its bulbs are, on average, more durable. So, this option is incorrect.

    * Company B's bulbs have a more consistent lifespan than Company A's bulbs.
    * Consistency is measured by dispersion. Lower standard deviation and IQR indicate higher consistency.
    * Company A: Standard Deviation = 50 hrs, IQR = 70 hrs.
    * Company B: Standard Deviation = 150 hrs, IQR = 200 hrs.
    * Company A has significantly lower standard deviation and IQR, meaning its bulbs are more consistent. So, this option is incorrect.

    * Company A's bulbs show less variability in lifespan compared to Company B's bulbs.
    * Variability is the opposite of consistency, measured by dispersion.
    * Company A's standard deviation (50 hrs) is much lower than Company B's (150 hrs).
    * Company A's IQR (70 hrs) is much lower than Company B's (200 hrs).
    * Both measures strongly indicate that Company A's bulbs have less variability. So, this option is correct.

    * Both companies have a symmetric distribution of bulb lifespans.
    * For Company A: Mean (1200) is slightly greater than Median (1190), suggesting a slight right-skew.
    * For Company B: Mean (1250) is significantly greater than Median (1200), suggesting a more pronounced right-skew.
    * Neither distribution appears perfectly symmetric (where Mean \approx Median). So, this option is incorrect.

    Answer: \boxed{C}"
    :::

    :::question type="NAT" question="A dataset has 11 observations: {10,12,15,18,20,22,25,28,30,32,35}\{10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 35\}. Calculate the Interquartile Range (IQR)." answer="15" hint="First, sort the data. Then find the median (Q2Q_2), followed by the median of the lower half (Q1Q_1) and the median of the upper half (Q3Q_3). Finally, calculate Q3Q1Q_3 - Q_1." solution="To calculate the Interquartile Range (IQR), we first need to find the first quartile (Q1Q_1) and the third quartile (Q3Q_3).

  • Sort the data: The given dataset is already sorted:

  • X={10,12,15,18,20,22,25,28,30,32,35}X = \{10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 35\}
    There are N=11N = 11 observations.

  • Find the Median (Q2Q_2):

  • The median is the (N+1)/2(N+1)/2-th observation.
    (11+1)/2=6(11+1)/2 = 6-th observation.
    Q2=22Q_2 = 22.

  • Find the First Quartile (Q1Q_1):

  • Q1Q_1 is the median of the lower half of the data (excluding the median if NN is odd).
    Lower half: {10,12,15,18,20}\{10, 12, 15, 18, 20\}
    The median of these 5 observations is the (5+1)/2=3(5+1)/2 = 3-rd observation.
    Q1=15Q_1 = 15.

  • Find the Third Quartile (Q3Q_3):

  • Q3Q_3 is the median of the upper half of the data (excluding the median if NN is odd).
    Upper half: {25,28,30,32,35}\{25, 28, 30, 32, 35\}
    The median of these 5 observations is the (5+1)/2=3(5+1)/2 = 3-rd observation.
    Q3=30Q_3 = 30.

  • Calculate the Interquartile Range (IQR):

  • IQR=Q3Q1=3015=15\begin{aligned} \text{IQR} & = Q_3 - Q_1 \\
    & = 30 - 15 \\
    & = 15\end{aligned}

    Answer: \boxed{15}"
    :::

    ---

    What's Next?

    💡 Continue Your CMI Journey

    You've mastered Data Interpretation and Summary Statistics! This chapter provides fundamental tools for understanding and describing datasets, which are indispensable for higher-level quantitative analysis.

    Key connections:
    Building on Previous Learning: The concepts of data types, ordering, and basic arithmetic from earlier foundational mathematics chapters are directly applied here. Understanding functions and basic algebra is crucial for calculating summary statistics.
    Foundation for Future Chapters: This chapter is a cornerstone for several upcoming topics. It directly prepares you for:
    Probability Theory: Understanding data distributions and summary statistics is essential for defining random variables and understanding their probability distributions (e.g., mean and variance of a random variable).
    Inferential Statistics: When you learn about sampling distributions, confidence intervals, and hypothesis testing, you'll be constantly applying the concepts of means, standard deviations, and data variability to draw conclusions about populations from samples.
    * Regression Analysis and Econometrics: These advanced topics rely heavily on descriptive statistics to characterize variables, understand relationships, and interpret model outputs. Visualizing data and understanding its spread are critical initial steps in any regression analysis.

    Keep practicing these core concepts, as they will be integrated into almost every subsequent quantitative chapter!

    🎯 Key Points to Remember

    • Master the core concepts in Data Interpretation and Summary Statistics before moving to advanced topics
    • Practice with previous year questions to understand exam patterns
    • Review short notes regularly for quick revision before exams

    Related Topics in Probability and Statistics

    More Resources

    Why Choose MastersUp?

    🎯

    AI-Powered Plans

    Personalized study schedules based on your exam date and learning pace

    📚

    15,000+ Questions

    Verified questions with detailed solutions from past papers

    📊

    Smart Analytics

    Track your progress with subject-wise performance insights

    🔖

    Bookmark & Revise

    Save important questions for quick revision before exams

    Start Your Free Preparation →

    No credit card required • Free forever for basic features