The standard deviation is a measure that is used to quantify the amount of variation or spread of a set of data values from its mean. A low standard deviation indicates that the data points tend to be close to the mean of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. The Standard Deviation is the square root of its variance and is normally reported along with the mean.
To Understand Standard Deviation Better, Let’s Go Through An Example:
In a class, there are 50 students. For an experiment, 25 students were asked to study maths for one hour and 25 for four hours
The result of the test when plotted on a 2D scatter plot is as follow:
results.plot(kind='scatter', x='No. of hours studied', y='Marks received');
plt.xlabel('No. of hours studied daily')
plt.title('Maths Score and hours spent in studying daily')
Once we add the mean scores for both the set of students we get the following result:
One = results['Marks received'][results['No. of hours studied']== 1].mean()
Four = results['Marks received'][results['No. of hours studied']== 4].mean()
Here we can understand that the mean score of students who have studied for one hour is 29.04 and the mean score for students who have studied for four hours is 40.20. There is a difference between the scores for students of in both sets. The students who have studied for four hours have received higher scores than students who have studied for one hour. This was an obvious deduction which any logical person would make.
However, things get interesting when we also look at the Std Dev for these sets.
StdOne = results['Marks received'][results['No. of hours studied']== 1].std(ddof=0)
StdFour = results['Marks received'][results['No. of hours studied']== 4].std(ddof=0)
Here, we can see that the spread of data is very different for both the sets. Though students who have studied for one hour have received less score in their maths score their scores are very close to each other as their standard deviation is very low. However, it can be noted that as the number of hours studied by a student increases for one to four, their scores also increase and also the standard deviation. This might imply that as the number of hours increase, the scores do not increase proportionately. Some students have exponentially increased their scores whereas some even have lesser than the highest received score by the student who have studies for one hour.
A researcher when he looks at this data, might start thinking about the possible reasons for such a scenario. It might be because some student does not grasp the topic even after studying for longer hours or the level of concentration might be low or has lower or higher focus than others as the number of hours increases, etc etc. One thing can be noted here is that as the number of hours studied by a student increases their scores does not increases accordingly. So, he might conclude that only studying for a greater number of hours does not guarantee good scores there are other factors also involved which needs to be addressed. However, as the mean is going up it shows that, he is the right direction.
Having only mean with us will not help us see this insight but because of standard deviation can help us better understand the context in which the mean is present.
Points To Note For Python Users Calculating Standard Deviation
One very important point to be noted when calculating the standard deviation in python using pandas dataframe is that, it calculates the standard deviation of a sample and if we want to calculate the standard deviation of the population we need to make the degree of freedom as 0. The reason why python does this is because in real world it is not very easy to find the data of an entire population. Hence, standard deviation of a sample is taken.
We Can Calculate The Same Standard Deviation Using Excel Too, Using The Following Steps:
Step 1: Calculate the mean for scores of students.
Step 2: Subtract each value from its mean to get the deviation of that value from the mean.
Step 3: Square the value of deviation. If we do not square our deviation, all the negatives number and positive numbers will add up to be zero thereby not giving us any useful information. Here, we are trying to understand the spread or distance from the mean. A number being positive or negative is not of any importance. Hence, we square the value to make all number positive.
Step 4: Take the average of the deviation to find the variance of the sample.
Step 5: Take the square root of the variance to get the population standard deviation. Taking the square root helps us to bring our value to the same unit as the data which helps in interpreting the data with ease.
In Mathematics, The Formula For Standard Deviation Is As Follows:
σ = sqrt (Σ ( Xi - X )2 / N)
Where σ = Standard Deviation
Sqrt = Square root of the entire function
(Xi - X )2 = (Individual Value – Mean)2
Σ = Summation of function associated with it
N = Total number of data points in our dataset
Advantage Of Standard Deviation:
- Unlike variance, it is expressed in the same units as the mean It is not easy to read variance as it is the squared format and hence not easily interpretable. However, Standard deviation being in the same units as the mean we can easily understand the spread of data.
- It is also used to measure the confidence interval in statistical conclusions. Using standard deviation and empirical rule (Read more to understand what is the empirical rule), we can easily calculate the confidence intervals in a normal distribution.
- It helps us to understand the central value with a better context. For example, the average score received by students who studied fours hours is 40.2. However, students have even received scores as low as 29 which is even lower than scores received by some students who have studied for one hour.
- Helps us to understand risk. In cases where high precision is needed like clinical trials or building a rocket to send to space. The scope for error in such cases is very less. When the test results come out, we would like to have a very low standard deviation in order to conclude that the test went successfully or say the test had minimal risks.
An Interesting Property Of Standard Deviation Is When We Have A Normal Or Gaussian Distribution
When our data follows a normal distribution, our data follows the empirical rule of 68 – 95 – 99.7. This empirical rule state that there will be 68% of data within 1 standard deviation from the mean, 95% of data will be within 2 standard deviation from the mean and 99.7% of data within 3 standard deviation from the mean. This rule is also called as Three Sigma Rule.
To understand this point, we can take the example of temperature of a city in a particular season. If the mean temperature of a city is 29 Degrees centigrade and it follows a normal distribution with the standard deviation of 2 degrees.
Then as per the empirical rule:
- 68% of days will have the average daily temperature between 27 – 31 degrees.
- 95% of days will have the average daily temperature between 25 – 33 degrees.
- 99.7% of days will have the average daily temperature between 23 – 35 degrees.
This rule is very useful when we have to build predictive analysis model more so when we do not have the entire data at hand. It gives us some pretty close estimate of hoe the actual data will look like.