Elementary Statistics and Probability
Overview
In the quantitative analysis of systems and processes, we are often confronted with large sets of numerical data. The ability to distill this raw information into a comprehensible and meaningful form is a fundamental skill for any engineer or data scientist. This chapter introduces the foundational principles of elementary statistics and probability, which together provide the mathematical framework for describing data, quantifying uncertainty, and making informed inferences. A strong command of these topics is indispensable, as they not only appear as direct questions in the GATE examination but also underpin more advanced concepts in data science and artificial intelligence.
We shall begin our study with descriptive statistics, focusing first on measures of central tendency. These metrics, such as the mean, median, and mode, allow us to identify a central or typical value around which a dataset is distributed. Subsequently, we will explore measures of dispersion, including variance and standard deviation, which quantify the extent to which data points deviate from the center. Understanding both centrality and spread is critical for a complete characterization of any dataset. The chapter culminates with an introduction to the theory of probability, the branch of mathematics concerned with the analysis of random phenomena. This final section equips us with the tools to model and reason about uncertainty, a concept central to modern engineering and data analysis.
---
Chapter Contents
| # | Topic | What You'll Learn |
|---|-------|-------------------|
| 1 | Measures of Central Tendency | Calculating the central or typical value. |
| 2 | Measures of Dispersion | Quantifying the spread or variability of data. |
| 3 | Probability | Understanding the likelihood of random events. |
---
Learning Objectives
After completing this chapter, you will be able to:
- Calculate and interpret the mean, median, and mode for various types of datasets.
- Determine the range, variance (), and standard deviation () to assess data variability.
- Apply the fundamental axioms and rules of probability to solve problems involving random experiments.
- Distinguish between appropriate uses of different statistical measures to analyze and compare datasets.
---
We now turn our attention to Measures of Central Tendency...
Part 1: Measures of Central Tendency
Introduction
In the study of statistics, our primary objective is often to distill a large, complex dataset into a few representative numerical values that summarize its essential characteristics. Measures of central tendency, also known as measures of location, serve this precise purpose. They provide a single value that attempts to describe the center of a distribution of data. This central value acts as a typical or representative score for the entire group, offering a concise summary of the dataset's overall magnitude.
Understanding these measures—the mean, median, and mode—is fundamental to quantitative analysis. For the GATE examination, a firm grasp of their definitions, properties, and appropriate applications is not merely advantageous but essential. We will explore the calculation of these measures, delve into their distinct properties, and critically examine the conditions under which one measure is more appropriate than another. This chapter will equip you with the foundational knowledge required to analyze datasets and solve problems related to data distribution with precision and confidence.
A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. It indicates where most values in a distribution fall and is also referred to as the central location of a distribution. The three most common measures are the mean, median, and mode.
---
Key Concepts
We shall now proceed to a formal examination of the principal measures of central tendency. Each possesses unique characteristics that render it suitable for specific types of data and analytical objectives.
1. The Arithmetic Mean
The arithmetic mean, often simply called the mean or average, is the most widely used measure of central tendency. It is calculated by summing all the values in a dataset and dividing by the number of values.
For a set of observations , the arithmetic mean, denoted by , is given by:
Variables:
- = The arithmetic mean
- = The sum of all observations
- = The total number of observations
When to use: For quantitative data that is not significantly skewed by outliers. It is sensitive to every value in the dataset.
A crucial property of the mean is that the sum of all observations can be directly derived from it. This relationship, , is fundamental to solving a common class of problems in competitive examinations, particularly those involving the correction of data entry errors.
Worked Example: Correcting an Incorrect Mean
Problem: The mean score of a class of 40 students in a test was calculated to be 75. Later, it was discovered that the score of one student was misread as 56 instead of the correct score of 96. Find the correct mean score.
Solution:
Step 1: Calculate the incorrect sum of scores using the incorrect mean.
We are given and the incorrect mean .
Step 2: Adjust the sum by removing the incorrect value and adding the correct value.
The incorrect value was 56, and the correct value is 96.
Step 3: Calculate the correct mean using the correct sum.
The number of students, , remains 40.
Answer: The correct mean score of the class is .
---
---
2. The Median
The median is the middle value of a dataset that has been arranged in ascending or descending order. It represents the positional center of the data, effectively dividing the dataset into two equal halves. Unlike the mean, the median is not affected by extreme values or outliers, making it a more robust measure for skewed distributions.
The calculation of the median depends on whether the number of observations, , is odd or even.
1. For an odd number of observations ( is odd):
First, arrange the data in ascending order. The median is the value of the middle observation.
2. For an even number of observations ( is even):
First, arrange the data in ascending order. The median is the arithmetic mean of the two middle observations.
Variables:
- = The total number of observations
When to use: For ordinal data or for quantitative data with significant outliers or skewness.
By its very definition, the median is the value that separates the higher half from the lower half of a data sample. Consequently, at most 50% of the observations are greater than the median, and at most 50% are less than the median. This is a fundamental property frequently tested in conceptual questions. Note that if multiple data points are equal to the median, the percentage can be less than 50%.
Worked Example: Finding the Median
Problem: Find the median of the following dataset: .
Solution:
Step 1: Arrange the data in ascending order.
Step 2: Determine the number of observations, .
Here, , which is an odd number.
Step 3: Apply the formula for an odd number of observations to find the position of the median.
Step 4: Identify the value at this position in the sorted data.
The 4th observation in the sorted list is 28.
Answer: The median of the dataset is \boxed{28}.
---
3. The Mode
The mode is the value that appears most frequently in a dataset. A dataset may have one mode (unimodal), two modes (bimodal), more than two modes (multimodal), or no mode at all if all values occur with the same frequency.
The mode is the observation with the highest frequency in the dataset. It is the only measure of central tendency that can be used for categorical (nominal) data.
Worked Example: Identifying the Mode
Problem: Find the mode of the dataset: .
Solution:
Step 1: Tally the frequency of each unique value in the dataset.
- 5 appears 2 times
- 6 appears 1 time
- 7 appears 2 times
- 8 appears 4 times
- 9 appears 2 times
Step 2: Identify the value with the highest frequency.
The value 8 occurs most frequently (4 times).
Answer: The mode of the dataset is \boxed{8}. This distribution is unimodal.
---
4. Relationship between Mean, Median, and Mode
The relative positions of the mean, median, and mode are determined by the skewness of the distribution.
This relationship can be visualized as follows:
For moderately skewed distributions, an empirical relationship is often cited:
---
Problem-Solving Strategies
Problems involving an incorrect mean are common. The most efficient method is to work with the sum of observations, as it avoids dealing with individual data points.
- Calculate the Incorrect Sum using the given incorrect mean and : .
- Calculate the Correct Sum: .
- Calculate the Correct Mean: .
This structured approach minimizes calculation errors under exam pressure.
When a question asks for a statement that must be true, focus on the definitions.
- The mean's value depends on every single data point. Changing one value changes the mean.
- The median's value only depends on the middle one or two points. Its defining property is that it splits the data into two halves. A statement based on this property is robust and likely to be the correct answer, as it holds true for any distribution.
- The relationship between mean and median ( or ) is not fixed; it depends entirely on the skewness of the data, which is usually unknown.
---
Common Mistakes
- ❌ Forgetting to sort data for the median: Calculating the median requires the data to be in ascending or descending order. Finding the middle value of an unsorted list will almost always yield an incorrect answer.
- ❌ Confusing the median's position with its value: The formula gives the position of the median in the sorted list, not the median itself.
- ❌ Assuming Mean = Median: Students sometimes incorrectly assume a symmetric distribution. Unless stated otherwise, you cannot assume any fixed relationship between the mean and the median.
---
Practice Questions
:::question type="MCQ" question="The mean age of a committee of 10 members is 48 years. A member, aged 62, retires and is replaced by a new member aged 34. What is the new mean age of the committee?" options=["45.2 years","45.8 years","46.0 years","46.2 years"] answer="45.2 years" hint="Use the concept of sums. Find the initial total age, adjust for the change, and then calculate the new mean." solution="
Step 1: Calculate the initial total age of the committee.
The number of members is and the mean age is .
Step 2: Adjust the total age after the replacement.
A member aged 62 leaves, and a member aged 34 joins.
Step 3: Calculate the new mean age.
The number of members, , remains 10.
Result: The new mean age of the committee is 45.2 years.
Answer: \boxed{45.2 \text{ years}}
"
:::
:::question type="NAT" question="The mean of 5 numbers is 30. If one number is excluded, their mean becomes 28. What is the excluded number?" answer="38" hint="Calculate the sum of the original 5 numbers and the sum of the remaining 4 numbers. The difference between these sums is the excluded number." solution="
Step 1: Calculate the sum of the original 5 numbers.
Let the original sum be . We are given and .
Step 2: Calculate the sum of the remaining 4 numbers after one is excluded.
Let the new sum be . Now and .
Step 3: Find the excluded number by taking the difference of the sums.
Result: The excluded number is 38.
Answer: \boxed{38}
"
:::
:::question type="MSQ" question="Consider the dataset . Which of the following statements is/are correct?" options=["The mean of the dataset is 5.","The median of the dataset is 5.5.","The mode of the dataset is 6.","The dataset is bimodal."] answer="The mean of the dataset is 5.,The median of the dataset is 5.5.,The mode of the dataset is 6." hint="Calculate the mean, median, and mode separately and compare them with the given options." solution="
Let us analyze each measure for the dataset .
1. Calculation of the Mean:
The number of observations is .
So, the statement "The mean of the dataset is 5" is correct.
2. Calculation of the Median:
First, sort the data: .
The number of observations is (even). The median is the average of the and terms.
The 3rd term is 5 and the 4th term is 6.
So, the statement "The median of the dataset is 5.5" is correct.
3. Calculation of the Mode:
We examine the frequency of each number:
- 2: 1 time
- 4: 1 time
- 5: 1 time
- 6: 2 times
- 7: 1 time
So, the mode is 6. The statement "The mode of the dataset is 6" is correct.
Since there is only one mode, the dataset is unimodal, not bimodal. Thus, the statement "The dataset is bimodal" is incorrect.
Result: The correct options are A, B, and C.
Answer: \boxed{\text{The mean of the dataset is 5.,The median of the dataset is 5.5.,The mode of the dataset is 6.}}
"
:::
:::question type="MCQ" question="For a particular dataset of employee salaries, the mean salary is Rs. 60,000 and the median salary is Rs. 45,000. Which statement most accurately describes the distribution of salaries?" options=["The distribution is symmetric.","The distribution is negatively skewed.","The distribution is positively skewed.","There is not enough information to determine skewness."] answer="The distribution is positively skewed." hint="Compare the values of the mean and the median. The mean is pulled in the direction of the long tail (outliers)." solution="
Step 1: Identify the given measures of central tendency.
Mean = Rs. 60,000
Median = Rs. 45,000
Step 2: Compare the mean and the median.
We observe that Mean > Median.
Step 3: Relate this comparison to the concept of skewness.
In a skewed distribution, the mean is pulled towards the long tail of extreme values.
- If Mean > Median, the tail is on the right side (higher values), which corresponds to a positive skew. This is common in salary data, where a few high-earning individuals pull the mean upwards.
- If Mean < Median, the tail is on the left side, corresponding to a negative skew.
- If Mean Median, the distribution is approximately symmetric.
Since , the distribution is positively skewed.
Result: The distribution is positively skewed.
Answer: \boxed{\text{The distribution is positively skewed.}}
"
:::
---
Summary
- Mean (Arithmetic Average): It is calculated as . Its most critical property for problem-solving is that the sum of observations is . The mean is sensitive to outliers.
- Median (Positional Average): It is the middle value of a sorted dataset. Its defining characteristic is that it divides the data into two equal halves. It is robust to outliers and is the preferred measure for skewed distributions.
- Mode (Most Frequent Value): It is the value with the highest frequency. It can be used for both numerical and categorical data.
- Mean, Median, and Skewness: The relationship between the mean and median indicates the skewness of the data. For positively skewed data, . For negatively skewed data, . For symmetric data, .
---
What's Next?
A thorough understanding of central tendency is the first step in describing a dataset. To gain a complete picture, we must also understand how the data is spread out.
- Measures of Dispersion (Variance, Standard Deviation): After locating the center of the data, the next logical step is to quantify its spread or variability. Measures like variance and standard deviation describe how tightly the data points cluster around the mean.
- Skewness and Kurtosis: These topics formally quantify the concepts of asymmetry and the "peakedness" of a distribution, building directly upon the relationships between mean, median, and mode that we have discussed here.
---
---
Now that you understand Measures of Central Tendency, let's explore Measures of Dispersion which builds on these concepts.
---
Part 2: Measures of Dispersion
Introduction
While measures of central tendency, such as the mean or median, provide a single value to represent the center of a dataset, they offer an incomplete picture of the data's characteristics. Consider two datasets with the same mean; one might have values clustered tightly around this mean, while the other might have values spread far and wide. To capture this "spread" or "variability," we employ measures of dispersion. These statistical tools are indispensable for understanding the heterogeneity and consistency within a set of observations.
Measures of dispersion quantify the extent to which data points in a distribution deviate from the average value. A small dispersion value indicates that the data points tend to be close to the mean (or another measure of center), implying high consistency. Conversely, a large dispersion value signifies that the data points are scattered over a wider range of values. In the context of data analysis, understanding dispersion is as crucial as knowing the central tendency, as it provides critical insights into the reliability and distribution of the data.
---
Key Concepts
1. Range
The simplest measure of dispersion is the range. It is defined as the difference between the maximum and minimum values in a dataset. While easy to compute, its reliance on only two extreme values makes it highly sensitive to outliers and often unrepresentative of the overall data spread.
Variables:
- = Range
- = The maximum value in the dataset
- = The minimum value in the dataset
When to use: For a quick, preliminary assessment of data spread, especially when the dataset is small and not expected to have extreme outliers.
---
2. Variance and Standard Deviation
Variance and standard deviation are the most common and statistically robust measures of dispersion. They quantify the average degree to which each point differs from the mean.
The variance is the average of the squared differences from the mean. A distinction is made between the population variance (when all members of a population are measured) and the sample variance (when a subset of the population is measured).
The standard deviation is simply the positive square root of the variance. Its primary advantage is that it is expressed in the same units as the original data, making it more interpretable than the variance.
Variables:
- = Population variance
- = Population standard deviation
- = Each value in the population
- = The population mean
- = The total number of observations in the population
Variables:
- = Sample variance
- = Sample standard deviation
- = Each value in the sample
- = The sample mean
- = The number of observations in the sample
When to use: Use sample formulas when dealing with a subset of data. The denominator (Bessel's correction) provides an unbiased estimate of the population variance. For GATE, problems will typically involve a sample of data unless explicitly stated otherwise.
Worked Example:
Problem: Find the sample standard deviation of the following dataset: .
Solution:
Step 1: Calculate the sample mean ().
Step 2: Calculate the squared differences from the mean, .
Step 4: Calculate the sample variance () using the formula with denominator . Here, .
Step 5: Calculate the sample standard deviation () by taking the square root of the variance.
Answer: The sample standard deviation is approximately .
---
---
3. Coefficient of Variation (CV)
The standard deviation is an absolute measure of dispersion. To compare the variability of two or more datasets with different means or different units of measurement, we require a relative measure. The Coefficient of Variation serves this purpose.
The Coefficient of Variation is a dimensionless measure of relative variability, expressed as the ratio of the standard deviation to the mean. It is often presented as a percentage.
Variables:
- or = Standard deviation
- or = Mean
When to use: To compare the consistency or variability of two different datasets. A lower CV implies greater consistency. For instance, comparing the variability in the price of rice (in Rupees) and the weight of apples (in grams).
---
Problem-Solving Strategies
Calculating deviations from the mean can be tedious, especially if the mean is not an integer. An algebraically equivalent formula, often called the computational or shortcut formula, simplifies the calculation.
For a sample, the variance can be calculated as:
This form avoids calculating each individual deviation, reducing computational steps and potential rounding errors.
---
Common Mistakes
- Confusing Population and Sample Formulas: ❌ Using in the denominator when calculating variance from a sample. This underestimates the population variance. ✅ Always use for sample variance unless the problem explicitly concerns an entire population.
- Forgetting the Square Root: ❌ Reporting the variance when the standard deviation is asked. The units will be squared (e.g., instead of ), which is a clear indicator of this error. ✅ Always take the final square root of the variance to find the standard deviation.
- Comparing Standard Deviations Directly: ❌ Concluding that a dataset with a standard deviation of 10 is more variable than one with a standard deviation of 5, without considering their means. ✅ Use the Coefficient of Variation (CV) to compare relative variability between datasets with different scales or units.
---
Practice Questions
:::question type="NAT" question="The scores of a student in 5 tests were 12, 15, 13, 16, 14. Calculate the population variance of these scores." answer="2" hint="First, calculate the mean of the dataset. Then, find the sum of the squared differences from the mean and divide by the total number of observations, N." solution="
Step 1: Calculate the population mean ().
Step 2: Calculate the sum of squared differences from the mean.
Step 3: Calculate the population variance () by dividing by .
Result:
The population variance is .
"
:::
:::question type="MCQ" question="Two batsmen, A and B, have the following scores in a series of matches:
Batsman A: Mean = 50 runs, Standard Deviation = 10 runs
Batsman B: Mean = 30 runs, Standard Deviation = 9 runs
Who is the more consistent batsman?" options=["Batsman A","Batsman B","Both are equally consistent","Cannot be determined"] answer="Batsman A" hint="Consistency is determined by the relative measure of dispersion, the Coefficient of Variation. A lower CV indicates higher consistency." solution="
Step 1: Calculate the Coefficient of Variation (CV) for Batsman A.
Step 2: Calculate the Coefficient of Variation (CV) for Batsman B.
Step 3: Compare the CVs.
Since , Batsman A has lower relative variability and is therefore more consistent.
Result:
is the more consistent batsman.
"
:::
:::question type="MSQ" question="Let a dataset have a sample variance of . If every observation in the dataset is increased by a constant , and then every resulting observation is multiplied by a constant , which of the following statements about the new dataset are true?" options=["The new mean is 2 times the old mean plus 10.","The new variance is 4 times the old variance.","The new standard deviation is 2 times the old standard deviation.","The new variance is 2 times the old variance plus 5."] answer="The new mean is 2 times the old mean plus 10.,The new variance is 4 times the old variance.,The new standard deviation is 2 times the old standard deviation." hint="Analyze the effect of shifting (adding a constant) and scaling (multiplying by a constant) on measures of central tendency and dispersion. Adding a constant shifts the mean but does not change the spread (variance/SD). Multiplying by a constant scales both the mean and the standard deviation." solution="
Let the original data points be . The original mean is and variance is .
The new data points are .
1. Effect on Mean:
The new mean is:
So, the statement "The new mean is 2 times the old mean plus 10" is correct.
2. Effect on Variance:
The new variance is based on the deviation .
The new variance is:
So, the statement "The new variance is 4 times the old variance" is correct. The statement "The new variance is 2 times the old variance plus 5" is incorrect.
3. Effect on Standard Deviation:
The new standard deviation is the square root of the new variance.
So, the statement "The new standard deviation is 2 times the old standard deviation" is correct.
"
:::
---
---
Summary
- Purpose of Dispersion: Measures of dispersion (Range, Variance, Standard Deviation) quantify the spread or variability in a dataset, providing a necessary complement to measures of central tendency.
- Variance vs. Standard Deviation: Variance is the average of squared deviations from the mean. Standard Deviation is its square root, returning the measure of spread to the original units of the data, which aids in interpretation.
- Sample vs. Population: For GATE problems, assume you are working with a sample unless specified otherwise. Critically, use the denominator when calculating sample variance ().
- Relative Comparison: When comparing the variability of datasets with different means or units, always use the Coefficient of Variation (CV). A lower CV signifies greater consistency.
---
What's Next?
This topic connects to:
- Probability Distributions: The standard deviation () is a key parameter that defines the shape of many important distributions, such as the Normal Distribution. A larger results in a wider, flatter curve.
- Inferential Statistics: Sample variance () is a fundamental statistic used to estimate the unknown population variance () and is crucial in hypothesis testing and constructing confidence intervals.
Mastering these connections is essential for a comprehensive understanding of data analysis for the GATE examination.
---
Now that you understand Measures of Dispersion, let's explore Probability which builds on these concepts.
---
Part 3: Probability
Introduction
Probability theory provides the mathematical framework for quantifying uncertainty. In the context of engineering and data analysis, it is the bedrock upon which statistical inference, machine learning models, and risk assessment are built. An understanding of probability allows us to model random phenomena, make predictions in the face of incomplete information, and draw robust conclusions from data. For the GATE examination, a firm grasp of foundational probability concepts is not merely advantageous; it is essential for tackling a wide range of quantitative problems.
We begin our study by formalizing the intuitive notions of chance and likelihood. The principles we establish will enable the analysis of discrete and continuous random events, from simple coin tosses to complex system failures. This chapter focuses on the core axioms, rules of combination (such as addition and multiplication), and the pivotal concepts of conditional probability and independence, which together form the toolkit for solving sophisticated problems.
Let be a finite sample space of all possible outcomes of a random experiment. Let be an event, which is a subset of the sample space . The probability of the event , denoted as , is a real number satisfying the following axioms:
- for any event .
- , where is the certain event (the entire sample space).
- For any two mutually exclusive events and (i.e., ), the probability of their union is .
For an experiment with equally likely outcomes, the probability of an event is given by the ratio of the number of favorable outcomes to the total number of possible outcomes:
---
Key Concepts
1. Complementary Events
Often, calculating the probability of an event occurring is complex, whereas calculating the probability of it not occurring is straightforward. This is the principle of complementary events. The complement of an event , denoted or , consists of all outcomes in the sample space that are not in .
The relationship between an event and its complement is fundamental and provides a powerful problem-solving technique, especially for questions involving phrases like "at least one."
Variables:
- = Probability of event A occurring.
- = Probability of event A not occurring.
When to use: This rule is exceptionally useful when calculating the probability of "at least one" success, "at least two" occurrences, etc. It is often simpler to calculate the probability of the complement (e.g., "zero successes" or "all distinct items") and subtract it from 1.
Worked Example:
Problem: Four distinct letters are chosen from the set {A, B, C, D, E, F, G} and arranged in a row. What is the probability that at least two of the chosen letters are vowels? (Vowels are A, E).
Solution:
Let be the event that at least two chosen letters are vowels. The complement, , is the event that either zero vowels or exactly one vowel is chosen. It is simpler to calculate .
The set contains 2 vowels {A, E} and 5 consonants {B, C, D, F, G}. We are choosing 4 letters.
Step 1: Calculate the probability of choosing zero vowels.
This means all 4 letters must be consonants. The number of ways to choose 4 consonants from 5 is . The total number of ways to choose 4 letters from 7 is .
Step 2: Calculate the probability of choosing exactly one vowel.
This means choosing 1 vowel from 2, and 3 consonants from 5. The number of ways is .
Step 3: Calculate the probability of the complementary event .
Since the events "0 vowels" and "1 vowel" are mutually exclusive, we can add their probabilities.
Step 4: Use the rule of complements to find .
Answer: The probability that at least two of the chosen letters are vowels is .
---
2. Independent Events and the Multiplication Rule
Two events are considered independent if the occurrence of one does not affect the probability of the occurrence of the other. This concept is central to problems involving multiple trials, such as rolling a die several times or drawing items from different populations.
Two events, and , are independent if and only if the probability of their intersection (both events occurring) is the product of their individual probabilities.
This rule extends to any number of mutually independent events: .
Worked Example:
Problem: A system has two components, C1 and C2, that operate independently. The probability of C1 failing is 0.1, and the probability of C2 failing is 0.2. The system functions if at least one of the components is operational. What is the probability that the system functions?
Solution:
Let be the event that C1 fails, and be the event that C2 fails. We are given and .
The system fails only if both components fail. Let be the event that the system fails.
Since the components operate independently, we can use the multiplication rule.
Step 1: Calculate the probability that both components fail.
Step 2: The event that the system functions, , is the complement of the event that the system fails.
Answer: The probability that the system functions is .
---
3. Conditional Probability and the Law of Total Probability
Conditional probability addresses the likelihood of an event occurring given that another event has already occurred. This leads to one of the most powerful theorems in probability, the Law of Total Probability, which allows us to find the probability of an event by considering different scenarios or cases.
Variables:
- = Probability of event A occurring, given that event B has occurred.
- = Probability of both A and B occurring.
- = Probability of event B occurring.
The Law of Total Probability is applied when the sample space can be partitioned into a set of mutually exclusive and exhaustive events, . We can then find the probability of another event by summing its conditional probabilities over the partition.
Variables:
- is an event.
- is a partition of the sample space (mutually exclusive and exhaustive).
When to use: Use this law when the probability of an event depends on which of several preceding events has occurred. This is common in multi-stage experiments.
Worked Example:
Problem: A factory has two machines, M1 and M2. M1 produces 60% of the daily output, and M2 produces 40%. The defect rate for M1 is 3%, and for M2 is 5%. If an item is selected at random from the day's production, what is the probability that it is defective?
Solution:
Let be the event that the selected item is defective.
Let be the event that the item was produced by machine M1.
Let be the event that the item was produced by machine M2.
The events and form a partition of the sample space.
Step 1: Identify the given probabilities.
The defect rates are conditional probabilities:
Probability of a defect, given it's from M1:
Probability of a defect, given it's from M2:
Step 2: Apply the Law of Total Probability.
Step 3: Substitute the values and compute.
Answer: The probability that a randomly selected item is defective is .
---
4. Binomial Probability Distribution
The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials. A Bernoulli trial is a random experiment with exactly two possible outcomes: "success" and "failure".
Variables:
- = Total number of independent trials.
- = The exact number of successes desired.
- = The probability of success on a single trial.
- = The probability of failure on a single trial (often denoted as ).
- is the binomial coefficient, representing the number of ways to choose successes from trials.
- The number of trials, , is fixed.
- Each trial is independent of the others.
- Each trial has only two possible outcomes (success/failure).
- The probability of success, , is constant for each trial.
When to use: This formula is applicable when an experiment satisfies all the following conditions:
Worked Example:
Problem: A student takes a 5-question multiple-choice quiz. Each question has 4 options, only one of which is correct. The student guesses randomly on every question. What is the probability that the student gets exactly 3 questions correct?
Solution:
This is a binomial experiment.
- Number of trials, .
- Number of successes, .
- Probability of success (guessing correctly), .
- Probability of failure (guessing incorrectly), .
Step 1: State the binomial probability formula.
Step 2: Substitute the given values into the formula.
Step 3: Calculate the binomial coefficient.
Step 4: Compute the final probability.
Answer: The probability of getting exactly 3 questions correct is .
---
---
Problem-Solving Strategies
When a question asks for the probability of "at least one" or "at least two" of something, your first instinct should be to consider the complement.
- Identify the event (e.g., "at least two identical items").
- Define the complement (e.g., "all items are distinct" or "zero or one identical item").
- Calculate , which is usually much simpler.
- The final answer is . This almost always saves significant calculation time and reduces errors.
For multi-stage problems where the outcome of the second stage depends on the first, use the Law of Total Probability.
- Identify the final event of interest (e.g., "drawing an orange ball on the second draw").
- Identify the mutually exclusive outcomes of the first stage that affect the second stage (e.g., "drew a green ball first" vs. "drew an orange ball first").
- Calculate the probability of the final event conditioned on each of the first-stage outcomes.
- Combine these using the formula . A simple tree diagram can help visualize the paths.
---
Common Mistakes
- ❌ Confusing Independence and Mutual Exclusivity: Independent events can occur together (), while mutually exclusive events cannot (). For example, rolling a '1' and rolling a '2' on a single die roll are mutually exclusive, not independent. Rolling a '1' on a red die and a '2' on a blue die are independent.
- ❌ Forgetting the Binomial Coefficient: When calculating binomial probability, a common error is to only calculate . This ignores the fact that the successes can occur in any of the trials.
- ❌ Incorrect Sample Space in 'With/Without Replacement' problems: The size of the sample space (and the number of favorable items) changes after each draw in problems without replacement.
---
Practice Questions
:::question type="MCQ" question="An unbiased coin is tossed 6 times. What is the probability of getting at least one head and at least one tail?" options=["","","",""] answer="" hint="Consider the complementary events. The only outcomes that do not satisfy the condition are all heads or all tails." solution="
Step 1: Define the event and its complement.
Let be the event of getting at least one head and at least one tail.
The total number of outcomes is .
The complement event is that we get either all heads (HHHHHH) or all tails (TTTTTT).
Step 2: Calculate the probability of the complementary event.
There is only 1 way to get all heads, so .
There is only 1 way to get all tails, so .
These two events are mutually exclusive.
Step 3: Calculate the probability of the original event .
Result:
The probability is .
"
:::
:::question type="NAT" question="A box contains 4 red, 3 green, and 5 blue balls. Three balls are drawn in succession without replacement. The probability that the first is red, the second is green, and the third is blue is . Calculate the value of ." answer="20" hint="This is a sequence of dependent events. Calculate the probability of each step and multiply them." solution="
Step 1: Define the events and initial state.
Let be the events of drawing a red, green, and blue ball in the 1st, 2nd, and 3rd draws respectively.
Total balls = .
Step 2: Calculate the probability of the first event.
Step 3: Calculate the conditional probability of the second event.
After drawing one red ball, there are 11 balls left, of which 3 are green.
Step 4: Calculate the conditional probability of the third event.
After drawing one red and one green ball, there are 10 balls left, of which 5 are blue.
Step 5: Calculate the probability of the sequence, .
Step 6: Calculate the final value.
Result: The value is 20.
"
:::
:::question type="MSQ" question="A fair six-sided die is rolled twice. Let A be the event that the first roll is a prime number (2, 3, 5). Let B be the event that the sum of the two rolls is 7. Let C be the event that the second roll is an even number. Which of the following statements is/are true?" options=["A and B are independent events.","A and C are independent events.","B and C are independent events.","A and B are mutually exclusive events."] answer="A and B are independent events.,A and C are independent events.,B and C are independent events." hint="Calculate the probabilities P(A), P(B), P(C), and the probabilities of their intersections (A∩B, A∩C, B∩C). Check the condition for independence: P(X∩Y) = P(X)P(Y)." solution="
Total outcomes in the sample space = .
1. Analyze Probabilities of Individual Events:
- Event A (first roll is prime {2, 3, 5}): .
- Event B (sum is 7): Favorable pairs are {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)}. So, .
- Event C (second roll is even {2, 4, 6}): .
2. Check Independence for Pair (A, B):
- Event A ∩ B (first roll is prime AND sum is 7): The pairs are {(2,5), (3,4), (5,2)}. There are 3 such outcomes. .
- Check: .
- Since , events A and B are independent. Option A is correct.
3. Check Independence for Pair (A, C):
- Event A ∩ C (first is prime AND second is even): The number of favorable outcomes is . So, .
- Check: .
- Since , events A and C are independent. Option B is correct.
4. Check Independence for Pair (B, C):
- Event B ∩ C (sum is 7 AND second roll is even): The pairs are {(1,6), (3,4), (5,2)}. There are 3 such outcomes. So, .
- Check: .
- Since , events B and C are independent. Option C is correct.
5. Check Mutual Exclusivity for (A, B):
- For A and B to be mutually exclusive, their intersection must be empty. However, we found , which is not empty. So, they are not mutually exclusive. Option D is incorrect.
:::
---
Chapter Summary
Having explored the fundamental concepts of elementary statistics and probability, we can distill our discussion into several essential principles critical for the GATE examination.
- Measures of Central Tendency: We have established that the Mean (), Median, and Mode are single values that attempt to describe a set of data by identifying its central position. The arithmetic mean is the sum of all values divided by their count, the median is the middle value in an ordered set, and the mode is the most frequently occurring value. The choice of measure is crucial; the median, for instance, is less sensitive to outliers than the mean.
- Measures of Dispersion: To understand the spread or variability of data, we introduced the concepts of Variance () and Standard Deviation (). Variance represents the average of the squared differences from the Mean. The Standard Deviation, its square root, provides a measure of dispersion in the same units as the data, making it more interpretable.
- Fundamental Axioms of Probability: The probability of an event , denoted , is a value between 0 and 1, inclusive. signifies an impossible event, while signifies a certain event. For any sample space , the sum of probabilities of all possible elementary outcomes is 1.
- Rules of Probability: Our study has shown that the probability of the union of two events, and , is given by the Addition Rule: . For mutually exclusive events, where , this simplifies to .
- Independence and Conditional Probability: Two events are independent if the occurrence of one does not affect the probability of the other, such that . In contrast, conditional probability, , measures the probability of event occurring given that event has already occurred. It is defined as , provided .
- Bayes' Theorem: We concluded with a powerful tool for updating beliefs in light of new evidence. Bayes' Theorem relates the conditional and marginal probabilities of two random events, providing a formal method to calculate a posterior probability: .
---
Chapter Review Questions
:::question type="MCQ" question="A dataset of student marks is given by the set . A mark is selected at random from this set. What is the probability that the selected mark is greater than the arithmetic mean of the set?" options=["1/4","1/2","3/4","1"] answer="B" hint="First, calculate the arithmetic mean of the dataset. Then, determine the number of data points that are strictly greater than this mean and express this as a fraction of the total number of data points." solution="
Step 1: Calculate the arithmetic mean () of the dataset.
The dataset is .
The number of data points is .
The sum of the marks is:
The arithmetic mean is:
Step 2: Identify the marks in the set that are greater than the mean.
We need to find the marks such that .
The marks satisfying this condition are . There are 2 such marks.
Step 3: Calculate the probability.
The number of favorable outcomes (marks greater than the mean) is 2.
The total number of possible outcomes (total marks in the set) is 4.
The probability is given by:
Therefore, the correct option is B.
"
:::
:::question type="NAT" question="The scores of five students in a test are 4, 6, 8, 10, and 12. Calculate the population variance of these scores." answer="8" hint="First, find the mean of the data. Then, for each data point, find the square of its difference from the mean. The variance is the average of these squared differences." solution="
Step 1: Calculate the mean () of the scores.
The scores are . The number of scores, , is 5.
Step 2: Calculate the squared difference from the mean for each score.
The formula for population variance () is .
We calculate for each score :
Step 3: Sum the squared differences and calculate the variance.
The sum of the squared differences is:
Now, we divide by the number of scores, :
The population variance is 8.
"
:::
:::question type="MCQ" question="A dataset has a mean of 50 and a standard deviation of 10. If 5 is subtracted from every data point in the set, what will be the new mean and new standard deviation?" options=["New Mean = 50, New Standard Deviation = 5","New Mean = 45, New Standard Deviation = 10","New Mean = 45, New Standard Deviation = 5","New Mean = 50, New Standard Deviation = 10"] answer="B" hint="Consider how measures of central tendency (a measure of location) and measures of dispersion (a measure of spread) are affected by a uniform shift in all data points." solution="
Let the original dataset be .
The original mean is .
The original standard deviation is .
A constant is subtracted from every data point. The new dataset is .
Step 1: Calculate the new mean ().
The new mean is the average of the new data points:
Since , we have:
Adding or subtracting a constant from every data point shifts the mean by that same constant.
Step 2: Calculate the new standard deviation ().
The new standard deviation is calculated based on the dispersion of the new data points around the new mean:
Substitute :
This is the formula for the original standard deviation, .
Subtracting a constant from every data point shifts the entire distribution but does not change its spread or dispersion. The distances between the points remain the same, so the standard deviation is unchanged.
Thus, the new mean is 45 and the new standard deviation is 10.
"
:::
:::question type="NAT" question="Two machines, M1 and M2, produce bolts. M1 produces 60% of the bolts and M2 produces the remaining 40%. 5% of bolts from M1 are defective, while 2% from M2 are defective. If a bolt is chosen at random and found to be defective, what is the probability that it was produced by machine M1? (Answer up to two decimal places)." answer="0.79" hint="This is an application of Bayes' theorem. Let D be the event that a bolt is defective. You need to find P(M1|D)." solution="
Let be the event that a bolt is produced by machine M1.
Let be the event that a bolt is produced by machine M2.
Let be the event that a chosen bolt is defective.
From the problem statement, we have the following probabilities:
- Probability that a bolt is from M1:
- Probability that a bolt is from M2:
- Probability of a defective bolt given it's from M1:
- Probability of a defective bolt given it's from M2:
We want to find the probability that a bolt was produced by M1 given that it is defective, which is .
Using Bayes' Theorem:
First, we must calculate the total probability of a bolt being defective, , using the Law of Total Probability:
Now, we can substitute this value back into the Bayes' Theorem formula:
Rounding to two decimal places, the probability is 0.79.
"
:::
---
What's Next?
Having completed Elementary Statistics and Probability, you have established a firm foundation for several advanced and related chapters in the GATE syllabus. The principles of central tendency, dispersion, and probability are not isolated topics; they are the building blocks for more complex quantitative reasoning.
Key connections:
- Building on Basic Aptitude: This chapter formalizes the intuitive data analysis skills from general numerical ability. The methods we have covered provide a structured approach to the data interpretation problems often seen in the General Aptitude section.
- Foundation for Probability Distributions: The fundamental rules of probability are an absolute prerequisite for the next major topic in this domain: Probability Distributions. Concepts like random events and probability calculations are essential to understanding discrete distributions (such as Binomial and Poisson) and continuous distributions (such as Normal/Gaussian), which are frequently tested.
- Link to Linear Algebra and Calculus: In higher-level engineering mathematics and data science applications, statistical concepts are deeply intertwined with other mathematical fields. The mean and variance of datasets are central to machine learning algorithms that rely on linear algebra. Furthermore, the concept of a continuous random variable is defined using probability density functions, which require an understanding of integration from Calculus to find probabilities over intervals.