What Is Variance?
In the mean, we look at the central tendency of the data. However, it does not tell us the spread of the data, it only gives us that one value. To have a better understanding of our data we try to see how far is our data spread out from its mean or in a statistical term – what is the variance of our set of values. The technical definition of variance is the average of the squared differences from the mean. Variance is the measurement of the spread of numbers in our dataset.
Let’s See How Can We Calculate Variance
In a sandwich shop
Employee | Sandwiches prepared in hour |
Employee A | 21 |
Employee B | 15 |
Employee C | 18 |
Employee D | 18 |
Employee E | 17 |
Employee F | 20 |
Employee G | 22 |
Employee H | 21 |
Employee I | 23 |
To see what is the average performance of these employees and how spread are the results we will be calculating the mean and the variance.
Step 1: Calculate the mean for all employees.
Step 2: Subtract each value from its mean to get the deviation of that value from the mean.
Step 3: Square the value of deviation. If we do not square our deviation, all the negative numbers and positive numbers will add up to be zero thereby not giving us any useful information. Here, we are trying to understand the spread or distance from the mean. A number being positive or negative is not of any importance. Hence, we square the value to make all number positive.
Step 4: Take the average of the deviation to find the variance of the sample.
The entire step can be seen in the table below:
Employee | Sandwiches prepared in 1 hour | Deviation from Mean (Value – Mean) (step 2) | Square of deviation (Mean – value)^{2} Step 3 |
Employee A | 21 | 1.56 | 2.42 |
Employee B | 15 | -4.44 | 19.75 |
Employee C | 18 | -1.44 | 2.09 |
Employee D | 18 | -1.44 | 2.09 |
Employee E | 17 | -2.44 | 5.98 |
Employee F | 20 | 0.56 | 0.31 |
Employee G | 22 | 2.56 | 6.53 |
Employee H | 21 | 1.56 | 2.42 |
Employee I | 23 | 3.56 | 12.64 |
Mean (Step 1) | 19.44 | ||
Variance Avg (Deviation)^{2}(step 4) | 6.02 |
In the above example, we can observe that on an average each employee can prepare 19 sandwiches in one hour and there can be a variance of approximately 6 sandwiches squared. This is the variance of a continuous variable.
In Mathematical Terms, The Formula For Variance For Continous Variable Is:
σ^{2} = Σ ( X_{i} - X )^{2} / N
Where σ^{2} = variance
(X_{i} - X )^{2} = (Individual Value – Mean)^{2}
Σ = Summation of function associated with it
N = Total number of data points in our dataset
Now What If We Have Discrete Variable? How Will We Calculate The Variance Of A Discrete Variable?
It can be understood with the help of an example. Let us take the example of patients who have come for cancer testing.
Cancer | Number of patients |
Positive | 10 |
Negative | 90 |
In the above
- Step 1: Calculate the total number of patients (N). For us it is 100 patients
- Step 2: Calculate the probability that the patients will be tested positive (p) = 10/100 = 0.1
- Step 3: Calculate the probability that the patients will be tested negative (q) = 90/100 = 0.9
- Step 4: Multiply N*p*q = 100*0.1*0.9 = 9
The variance for the dataset of cancer patients is 9.
In Mathematical Terms, The Formula For Variance For Discrete Variable Is:
σ^{2} = N*p*q
Where σ^{2} = variance
N = Total number of data points in our dataset
P = Probability of first value
Q = Probability of second value
Variance Of A Sample
In statistics, most of the times we take the variance of a sample. Let’s assume, we want to understand the variance of heights of men in France. It will take a lot of time, money, manpower, etc to collect the height of all men in France. Hence, it is not always ideal for us to find all the values in our population. In such cases, we try to collect the data for heights of men in a sample which represents the population. The variance will then be collected from this sample. In order to reduce the bias which might arise when collecting the sample, we will not divide the sum of the deviation^{2} by the total number of data points (n) instead we divide by the total number of data points minus 1 (n – 1). In doing so we reduce the biases and do not underestimate the population variance. As we keep on increasing our sample size thereby going closer to the population size the biases will be minuscule.
The Variance Of A Sample Is Defined By Slightly Different Formula:
s^{2} = Σ ( x_{i} - x )^{2} / ( n - 1 )
Where s^{2} = sample variance
(X_{i} - X )^{2} = (Individual Value – Mean)^{2}
Σ = Summation of function associated with it
n-1 = Total number of data points in our dataset minus one
The Limitation Of Variance
Now we can note in our example that there is a variance of 6 sandwiches square in our sample data of sandwich shop and variance of 9 in our cancer dataset. But how can we quantify this data?
- Since the mean and variance have a different unit, it is difficult to read the variance along with the mean. We will have to calculate Standard Deviation in order to have a proper read of the spread of the data along the mean.
- The number which are far away from the biases or outliers will result in being a huge value, thereby skewing our data.
- It is not easy to understand whether the variance at which we arrived is large or not without having a good domain knowledge of the data we are considering.
Some Properties Of Variance Are:
Variance is a non-negative number as we always square our deviation value (See Step 3 in how to calculate variance)- If the variance of a dataset is Zero, then all the data point in our
dataset will have the same value.
Employee | Sandwiches prepared in one hour | Deviation from Mean | Square of deviation |
Employee A | 21 | 0.00 | 0.00 |
Employee B | 21 | 0.00 | 0.00 |
Employee C | 21 | 0.00 | 0.00 |
Employee D | 21 | 0.00 | 0.00 |
Employee E | 21 | 0.00 | 0.00 |
Employee F | 21 | 0.00 | 0.00 |
Employee G | 21 | 0.00 | 0.00 |
Employee H | 21 | 0.00 | 0.00 |
Employee I | 21 | 0.00 | 0.00 |
Mean | 21.00 | ||
Variance | 0.00 |
- If we add a constant value to our dataset, the variance remains unchanged.
Employee | Sandwiches prepared in one hour (Constant value of 5 added to all entries) | Deviation from Mean | Square of deviation |
Employee A | 26 | 1.56 | 2.42 |
Employee B | 20 | -4.44 | 19.75 |
Employee C | 23 | -1.44 | 2.09 |
Employee D | 23 | -1.44 | 2.09 |
Employee E | 22 | -2.44 | 5.98 |
Employee F | 25 | 0.56 | 0.31 |
Employee G | 27 | 2.56 | 6.53 |
Employee H | 26 | 1.56 | 2.42 |
Employee I | 28 | 3.56 | 12.64 |
Mean | 24.44 | ||
Variance | 6.02 |
There Are A Number Of Statistical Concepts Where Variance Is Used
- Descriptive statistics
- Statistical inferences
- Hypothesis testing
- Goodness of fit
- Monte Carlo sampling
Some Of The Application Of Variance Can Be Seen In The Following Fields:
- Statistical surveys: Variance can help us understand how flat or insightful our data is. If our data has high variance, we might be able to see get some interesting conclusion.
- Accounting: In accounting, Variance is used to understand whether there has been any deviation from what is estimated to what is biases happened with respect to cost, revenue, etc. The lower the variance the closer the company is to the what has been estimated during the budgeting period.
- Drug testing: During clinical testing of a new drug the scientist would like to have minimum variance in the test result which would mean that the drug has a universal acceptance for the intended disease.
The variance can also be thought of as the covariance of a random variable with itself. Variance is extensively used in probability theory, where from a given smaller sample set, more generalized conclusions need to be drawn. This is because variance gives us an idea about the distribution of data around the mean, and thus from this distribution, we can work out where we can expect an unknown data point. Though variance gives us a rough idea of the spread of our data, Standard Deviation is more accurate and easier to interpret. We will also try to understand what is Standard Deviation and why do we say that it is more concrete than variance in our next post.