Histogram – Read, Understand and Build
The histogram is a univariate analysis tool, which means it is used to visualize one variable in our dataset. It displays the frequency on the vertical or y-axis and the horizontal or x-axis of the feature under consideration. The histogram is used to understand the distribution of the variable in our dataset and was first introduced by Karl Pearson.
If for example, we would like to know how many people are suffering from heart disease in a particular area, a histogram is very useful. It will not only give us the count for the age range we are looking but it also helps us to have a comparative study of the count of people for a similar age interval across our population.
#People suffering from heart disease plt.hist(Sufferers['Age']) plt.xlabel('Age of people suffering from Heart Disease')
As we can see in the image above a lot of people above the age of 45 are suffering from heart disease but unlike before now, heart disease is not an uncommon occurrence among the younger generation too.
To further understand histogram let us do an exercise on the Iris Dataset.
- let’s first plot a 1D scatter plot for the three species using the feature petal length. In a 1D scatter plot below, we can see the endpoints of our feature but it does not tell us how many data points we have in each interval thereby giving us incomplete information. However, it gives us the endpoints.
setosa, = plt.plot(iris_setosa['petal_length'], np.zeros_like(iris_setosa['petal_length'])) virginica, = plt.plot(iris_virginica['petal_length'], np.zeros_like(iris_virginica['petal_length'])) versicolor, = plt.plot(iris_versicolor['petal_length'], np.zeros_like(iris_versicolor['petal_length'])) plt.xlabel('Petal Length') plt.legend([setosa, virginica, versicolor], ["Setosa","Virginica", "Versicolor"])
- Now histogram helps us to improvise on these endpoints. Let’s take Setosa species from our figure above. Here we have our start and end points as 1 and 2 respectively. We will divide these intervals into 5 parts such that we have 1)0 to 1.2, 2) 1.2 to 1.4, 3) 1.4 to 1.6, 4) 1.6 to 1.8 and 5) 1.8 to 2.0 and plot these in the x-axis
- In the y-axis, we will plot the number or the count of these intervals in our dataset and mark it in our plot. Once we have plotted the count for all the intervals we can join the points with bars for each interval.
sns.FacetGrid(iris_setosa, hue = 'species', size = 5) \ .map(sns.distplot, 'petal_length', kde=False) \ .add_legend();
- The end result of this exercise as seen above would be a histogram.
The histogram generally contains:
- Variable of feature interval in the x-axis
- Count or number of times these features exist in our dataset in the y-axis
- Bins – The bin_size decides how many bins will we have or vice versa. It is the most important parameter for a histogram and we should always try out a few different values of bin_size to select the best one for our data. We can never know what patterns are hidden under the bars!
Histogram makes far more sense than 1D scatter plot because it helps us get answers to questions such as:
- How many points do we have in a specific interval or range
- Where is the data highly concentrated
- How is the data distributed and so on
Python gives us a lot of liberty to customize our histogram where we can:
- Change the size of the bins
plt.hist(iris_setosa['petal_length'], bins = 7) plt.xlabel('Petal Length')
- Building a cumulative histogram. Here each bar is computed where each bin gives the counts in that bin plus all bins for smaller values. The last bin gives the total number of data points
#Building a cumulative histogram plt.hist(iris_setosa['petal_length'], cumulative=True) plt.xlabel('Petal Length')
- We can have different types of bar visualization in matplotlib and seaborn such as the default bar, barstacked, step and stepfilled
#Different type of bars - step plt.hist(iris_setosa['petal_length'], histtype='step') plt.xlabel('Petal Length')
- A density plot which is a smoothened, continuous version of a histogram. Seaborne easily help us to plot a kernel density estimation plot on top of our histogram
#Histogram with Kernel Density Plot sns.FacetGrid(iris_setosa, hue = 'species', size = 5) \ .map(sns.distplot, 'petal_length', kde=True) \ .add_legend();
Histogram, when used to its full potential, becomes an important tool in our Exploratory Data Analysis to analyze univariate data.
sns.FacetGrid(iris, hue = 'species', size = 5) \ .map(sns.distplot, 'petal_length', kde=False) \ .add_legend();
Here, we can see all the three species of the Iris flower and how the histogram is helping us to understand that the petal length of Setosa is clearly smaller than Versicolor and Virginica. Similarly, you can try looking at the other features of the Iris flower features and check whether we can follow a trend or you can check it out in my ipython notebook.