machine-learning
statistics
A confidence interval (CI) is an interval of the form (a, b) constructed from the data. The purpose of a CI is to cover an unknown population parameter with “high probability”.
Engineer | Problem Solver | Data geek
A confidence interval (CI) is an interval of the form (a, b) constructed from the data. The purpose of a CI is to cover an unknown population parameter with “high probability”. Here, a is called the lower confidence bound(or LCB) whereas b is called the upper confidence bound(or UCB), which are both functions of the data.
Whenever predicting a parameter, we write it in the form of:
Value = p̂ ± margin_of_error(K).
Here, p̂-k=a & p̂+k=b
For eg: Suppose we sampled 600 people and asked them whether they prefer vanilla or chocolate ice cream. And we're interested to know what proportion of our sample prefer vanilla over chocolate. We got the following result:
No. of people who like vanilla: 400
No. of people who like chocolate: 200
So,
What proportion of people prefer vanilla over chocolate?The answer is simply 400/600=0.67
This value of 0.67 is our p̂.
But why do we need to calculate the confidence interval if we already know the exact value of the parameter?
Well, as mentioned earlier, we have sampled only 600 people, so are we 100% sure 0.67 will be the exact answer to our quoted question?
NO.
This is where confidence intervals come into the picture. Using confidence intervals we can write an answer like:
I am 95% or 99% sure that the proportion of people who prefer vanilla over chocolate lies in the range (p̂-k,p̂+k)
But how do we calculate its confidence interval?
Let's do the math!
Import a few necessary modules.
Make-up some data
To write a value with its margin of error, we use the formula:
Where, z* is a multiplier, which is different for different confidence levels and n is the size of the data.
As the confidence level increases, the value of z* also increases, thus widening the interval. The different z* values for different confidence levels are:
Now let's calculate the 95% confidence interval of our above example:
Here, we can see that the 95% confidence interval is (0.6289,0.7044).
The difference them = 0.7044-0.6289 = 0.0755
Now, let's repeat the process for a 99% confidence interval.
Here, we can see that the 99% confidence interval is (0.617,0.7163).
The difference them = 0.7163-0.617 = 0.0993
Here, we see that the confidence interval of 99% confidence level is more than 95% confidence level, thus proving that the confidence interval range increases with an increase in confidence level(Read that line again).
Doesn't that make sense? The range of values(and margin of error) increases as we increase the confidence level, i.e, if we want to be more sure about our estimate, we increase the margin of error.
Let's visualize this:
Let's assume a variable x containing 1000 values.
Let's also plot its histogram.
By just looking at the graph, we can probably tell an interval where most of the values in x lies and also comment on its mean, median & skewness(Try it?).
But just how much do we mean by 'most of'? 60%? 50% or above? 90%?
Well, let's see,
A rule of thumb for bell-shaped graphs states that:
68% of the values in x would lie between μ-σ to μ+σ,
95% of the values in x would lie between μ-2σ to μ+2σ,
99.7% of the values in x would lie between μ-3σ to μ+3σ
This is called the empirical rule or the 68-95-99.7 rule.
Let's confirm this:
The theory checks out!
Now let's have a look at the histogram again to see how an increase in the range of confidence interval increases with the confidence level.
Simple, right?
MOE=
What if we try to maximize this function to widen our confidence interval range even more.
Let's see what we can do:
The margin of error function consists of z, p & n. The values of z & n are not manipulative. We can only manipulate the value of p.
Now for this function to take its maximum value, the value of p(1-p) should be maximized.
Let f = p * (1-p), then,
f = p - p^2
df/dp=1 - 2p
For maximizing f, df/dp=0,
1 - 2p = 0
p = 1/2 = 0.5
So, the MOE function reaches its maximum value when p=0.5
Replacing p=0.5 in MOE function, we get,
MOE=(z/2*√n)
This new MOE function widens the range of confidence intervals as much as possible for every confidence level thus being on a safer side. Also, it is only dependent on our sample size 'n', which is a good thing.
The confidence interval calculated using this new formula is called the Conservative confidence interval
Let's try that on our vanilla-chocolate problem.
What proportion of people prefer vanilla over chocolate?
We can see that the range of confidence interval for the same confidence level is much more for conservative confidence interval than regular confidence interval(calculated previously).
Now, instead of manually calculating CIs, let's use python's scipy to calculate confidence intervals.
Let's create the dataset for our vanilla-chocolate sampling where 400 people prefer vanilla & 200 people prefer chocolate.
Now let's write a function to calculate CI using scipy.
We got the same result as previously calculated.