Hypothesis Testing and Confident Interval estimation

Janani Nagarajan
6 min readJan 27, 2022

As discussed in the previous post “ Hypothesis testing “, “Confidence interval estimation “ are the most powerful tools when it comes to handling inferential data. To recall a bit from the previous discussion on inferential statistics, it uses data from a random sample collected and generalizes it to the entire population parameter.

Here hypothesis testing with confidence interval estimation is a form of statistical inference that acts as a tool in carrying out the above-mentioned process to reasonably understand how reliably one can extrapolate the observed finding from the sample to the larger population.

Before getting into details let's have the terminologies in place. So whenever we mean population mean it is represented by µ (pronounced as mu)and sample mean is represented by x̄ (pronounced as x bar). Also, Population standard deviation is represented by σ (pronounced as sigma) and the sample standard deviation is represented by s.

Here standard deviation is a measure of how dispersed the data is to the mean. That is if it is on the lower side then it means the data is more towards the mean value, and if it is on the higher side this tells us that the data points are largely spread out.

Whenever we deal with inferential statistics data, typically we are interested in getting either the population mean or the population proportion from the random sample data that we have collected.

By now we are clear that when we work with random samples and extrapolate information to make it reasonably accurate to the population there are exactly two scenarios to take into consideration.

i) Population mean is unknown

ii) Population proportion is unknown.

Let’s consider two examples to understand the difference in the above considerations.

Scenario, when Population mean, is unknown — When the average salary of the business graduates from top universities in India for the batch 2021 is to be done, we can’t collect data from the entire population and this makes it the case of inferential statistic example where we have to rely on the random samples and use them to define the population mean.

When Population proportion is unknown — Let’s take a common example of predicting the shares of votes of a particular candidate in an election. This refers to the process of finding the unknown population proportion.

Recollecting a few statistical distributions will help in a better understanding. Mostly all continuous data that we use in real-time i.e, quantities such as height, weight, pressure, volume are a few examples that have an infinite number of other values between two fixed values. When plotting these data it typically follows “Continuous distribution”.

Normal distribution, standard normal distribution, and T- distribution are some of the continuous distributions that we are going to use here.

Normal distribution — It is a bell-shaped continuous distribution that area symmetrical about the mean. The probability of the area under the entire curve is 1.

Standard Normal distribution — When the bell-shaped curve in a normal distribution is symmetric about the mean =0 and standard deviation =1. In hypothesis testing and confidence level interval estimation processes this is also termed as Z-statistic.

T-Statistic — As standard normal variate this is symmetric about the mean which takes the value 0. But this depends on a single parameter called Degrees of freedom( A constraint imposed by the data linked to the size of data being used. The larger the data set, the greater will be the degrees of freedom. As this increases, the t-distribution becomes closer and closer to standard normal distribution)

Central Limit Theorem states that regardless of the nature of the population distribution, the mean of the random samples from the samples collected always follows a normal distribution. This implies that we use either Z-statistic or the T- statistic in the hypothesis testing process.

Let's take examples now,

Have you ever wondered how are the prediction of the share of voters to different political parties done with a reasonable degree of accuracy? Let’s take a case of an election where candidate A and candidate B are competing. We have to predict the results of this election out of curiosity beforehand. But the potential voters are in say Lakhs in numbers. It’s practically not possible to reach out to each and everyone to record their vote. So, let’s take a random sample of 1000 voters out of this potential voter population that we have assumed and we conduct a survey. The results are assumed as follows,

Let’s say 700 voted for A and 300 voted for B. Does this mean 70% of the voters in the actual election are about to vote for A? No, right? Yes, you are right, The answer is Obviously No!

This is where confidence interval and hypothesis testing come into the picture. Let’s take the statement that 75% of voters actually vote for A on the actual election day and test this case.

First, let's construct a confidence interval that is how sure are we about the population proportion that we are about to find now.

The confidence level for unknown population proportion p

Here p is the population proportion we are about to find i.e the proportion of the entire voters who’ll vote for A on the election day. The equation on both sides of the inequality above represents sample proportion +/-margin of error. Example: If we predict that A will receive 68% of votes on the actual election day by this hypothesis test, the confidence interval tells that we are sure of this by +/- margin of error maybe a 5% error which means the actual result can be [68–5<68<68+5] as the range of our prediction.

Here α (alpha) is the probability outside the confidence interval. That is if our test is done with a confidence level of 95%, in other words, if we want our test results to be assured with a probability of 0.95 then α=1–0.95=0.05. If we need our results at the confidence level of 90%, then α=1–0.90=0.10.

For our case, the confidence interval is with a margin of error at 2.8% i.e 3%(app.).

Let’s get into the hypothesis test now. Construct a null hypothesis and an alternate hypothesis.

Null hypothesis is the assumption that we make to conduct the hypothesis test. If this is true at the end of the test then we accept this assumption else alternate hypothesis will be accepted.

H₀= p≥0.75 (Null hypothesis: Assuming that on the final day 75% or greater than 70% of the population will be voting for A)

H₁=p<0.75 (Alternate hypothesis: Assuming that on the final day less than 75% of the population will be voting for A)

Calculating the Z-statistics

Here p cap is the sample proportion, p is the population proportion, n is the total sample into consideration. On calculating, we get z= -3.65 (app.)

Since this is a one-tailed test,

Since we have accepted the alternate hypothesis, on the final day less than 75% of the entire population voted for A. This is the conclusion that we can draw from the test with a 95% confidence level that this result is accurate with a margin of error of 3%.

This is one of the real-time examples of using hypothesis testing for inferential statistics data. Thanks for reading! :)

--

--