Histogram Read Understand Build

Histogram Read Understand Build

The histogram is a univariate analysis tool which means it is used to visualize one variable in our dataset. It displays the frequency on the vertical or y-axis and the horizontal or x-axis is the feature under consideration. The Histogram is used to understand the distribution of the variable in our dataset and was first introduced by Karl Pearson.

If for example we would like to know how many people are suffering from heart disease in a particular area, histogram is very useful. It will not only give us the count for the age range we are looking but it also helps us to have a comparative study of the count of people for a similar age interval across our population.

Age of people suffering from Heart Disease Histogram
Age of people suffering from Heart Disease

As we can see in the image above a lot of people above the age of 45 are suffering from heart disease but unlike before now heart disease in not an uncommon occurrence among the younger generation too.

To Further Understand Histogram Let Us Do An Exercise On The Iris Dataset:

  1. let’s first plot a 1D scatter plot for the three species using the feature petal length. In a 1D scatter plot below, we can see the endpoints of our feature but it does not tell us how many data points we have in each interval thereby giving us incomplete information. However, it gives us the endpoints.
1D Scatter Plot
1D Scatter Plot

2. Now the histogram helps us to improvise on these endpoints. Let’s take Setosa species from our figure above. Here we have our start and end points as 1 and 2 respectively. We will divide these intervals into 5 parts such that we have 1) 1.0 to 1.2, 2) 1.2 to 1.4, 3) 1.4 to 1.6, 4) 1.6 to 1.8 and 5) 1.8 to 2.0 and plot these in the x-axis

Histogram Intervals
Intervals

3. In the y-axis we will plot the number or the count of these intervals in our dataset and mark it in our plot. Once we have plotted the count for all the intervals we can join the points with bars for each interval.

Simple Histogram
Simple Histogram

4. The end result of this exercise as seen above would be a histogram.

The Histogram Will Contain:

  • Variable of feature interval in x-axis
  • Count or number of times these features exist in our dataset in y-axis
  • Bins – The bin_size decides how many bins will we have or vice versa. It is the most important parameter for a histogram and we should always try out a few different values of bin size to select the best one for our data. We can never know what patterns can be hidden under the bars!

Histogram Makes Far More Sense Than 1D Scatter Plot Because It Helps Us Get Answers To Questions Such As:

  • How many points do we have in a specific interval?
  • Where is the data highly concentrated?
  • How is the data distributed and so on?

Some Of The Advantages Of Histogram Are:

  • Histogram makes our task easier to identify different data, the frequency of the data occurring in the dataset and categories which are difficult to interpret in a tabular form.
  • It helps to visualize the distribution of the data.
  • When we have huge data sets it can be easily visualized using a histogram.
  • It also helps get an understand of the skewness of the data.

Python Gives Us A Lot Of Liberty To Customise Our Histogram Where We Can:

  • Change the size of the bins
Histogram with different bin size
Histogram with different bin size
  • Building a cumulative histogram. Here each bar is computed where each bin gives the counts in that bin plus all bins for smaller values. The last bin gives the total number of The histogram.
Cumulative histogram
Cumulative histogram
  • We can have different type of bar visualization in matplotlib such as the default bar, barstacked, step and stepfilled
histogram with steps
histogram with steps
  • A density plot which is a smoothened, continuous version of a histogram. Seaborne easily The histogram us to plot a kernel density estimation plot on top of our histogram
Histogram with Kernel Density Plot
Histogram with Kernel Density Plot

Histogram, when used to it’s full potential, becomes an important tool to visualize and analyze univariate data.

Histogram with all species comparison
Histogram with all species comparison

Here, we can see all the three species of the Iris flower and how the histogram is helping us to understand that the petal length of Setosa is clearly smaller than Versicolor and Virginica. Similarly, you can try looking at the other features of the Iris flower features and check whether we can follow a trend or you can check it out in my ipython notebook.

This Post Has 2 Comments

Leave a Reply

Close Menu