Probability Density Function
Probability Density Function (PDF) is used for univariate analysis. Univariate analysis as the name suggests is one variable analysis. So, PDF is very helpful when we have to dig deeper into a particular feature. Histogram, as we can see, is a plot of the data we have collected. It can give you an idea about how the probability distribution of your measurement looks, but it cannot give you an accurate figure for the probability for how often your measurement falls some distance away from the mean, especially as you go further away from the mean.
For example, if you’re trying to find the probability of a measurement lying more than five or six Standard deviations away from the mean, you may have to take close to a million measurements to even get a single measurement that lies at this distance from the mean! In most practical cases, this is impossible/uneconomical. Hence, curve fitting of a standard pdf curve is done on your histogram, and once you get a good fit, you can predict the probability of measurements falling to any distance from the mean quite accurately, without having to take a huge number of measurement samples.
Probability density function (PDF) defines a probability distribution for a continuous random variable whereas we have probability mass function (PMF) for a discrete random variable.
To understand the difference between the above two concepts and what actually PDF is. Let’s look at the following examples:
Probability Mass Function (PMF) is used for discrete random variable i.e. when our values are integers. If we want to see the probability distribution of each tree with respect to its apple produce we will have PMF as the number of apple in any given tree will be an integer value only such as 412 or 347. It will never be in decimals. In such a case we use probability mass function.
Probability Density Function (PDF) is used for continuous random variable i.e. when our values are in decimals. When we want to see the probability distribution of the height of all students in a class we will use PDF as the height is not an integer i.e. no two students will have the exact same height. Student A might have 105.234 cm whereas Student B might have 104.982. The difference between the height might be negligible from our naked eye but in absolute terms, it will always vary.
How do we read a PDF?
When the PDF is graphically portrayed, the area under the curve will indicate the interval in which the variable will fall. The total area in this interval of the graph equals the probability of a continuous random variable occurring. Now if we take any two values in this interval and try to find the area under the curve for that interval what we will receive will be the probability for the event to happen in that interval. The PDF is used to specify this probability of the random variable falling within a particular range of values, as opposed to taking on one value
For example, in our classroom, we have 100 students. Since each student will have a unique height the probability for any value will be 0 which is not a very useful number for us as the height can take an infinitesimal number of possibilities. In such cases, we will use PDF where we will give an interval for which the probability will be computed. Here, if we want to find the probability of student with the height of 103 to 107 we will get a probability of say 0.48 or 48%. Now, this information is useful for the recipient if he would like to arrive at a certain conclusion regarding his theory to check the height of students in that range.
As seen above, a continuous random variable (In our case height of a student) takes on an uncountably infinite number of possible values. For a discrete random variable X that takes on a finite or countably infinite number of possible values, we determined P(X = x) for all of the possible values of X and called it the probability mass function (PMF). For continuous random variables, the probability that X takes on any particular value x is 0. i.e. finding P(X = x) for a continuous random variable X is not going to work. Instead, we’ll need to find the probability that X falls in some interval (a, b), that is, we’ll need to find P(a < X < b). We’ll do that using a probability density function (PDF).
To further understand the concept, let’s take the example of Abalone dataset.
The Abalone dataset contains the physical measurements of abalones, which are large, edible sea snails. For our case, we will be using the length variable and build our Probability Density Function for that column.
The image below is of a simple histogram which shows the density of the length of Abalones
sns.FacetGrid(abalone, size = 5) \ .map(sns.distplot, 'Length', bins=10) \ .add_legend();
When we reduce the bin size of this histogram what we will receive is a smoother version of the histogram
sns.FacetGrid(abalone, size = 5) \ .map(sns.distplot, 'Length', bins=20) \ .add_legend();
Now when we further reduce the interval size and add a line to this histogram, we will have a visual understanding as to from where does the probability density function originate and why is it called a smoothened form of a histogram.
sns.set_style("whitegrid"); sns.FacetGrid(abalone, size = 5) \ .map(sns.distplot, 'Length', bins=30, kde=True) \ .add_legend();
If we look at the image above, we can clearly understand that probability function is just histogram with very small intervals in a smoothened format. Such a curve is denoted f(x) and is called a (continuous) probability density function (PDF). Now, a density histogram is defined so that the area of each rectangle equals the relative frequency of the corresponding class, and the area of the entire histogram equals 1. That suggests then that finding the probability that a continuous random variable X falls in some interval of values involves finding the area under the curve f(x) sandwiched by the endpoints of the interval. In the case of the figure below, the shaded area is the probability of the length of Abalone between 0.4 to 0.6.
Probability density function is defined by following formula:
- [a,b][a,b] = Interval in which x lies.
- P(a≤X≤b)P(a≤X≤b) = probability that some value x lies within this interval.
- dxdx = b-a
The probability density function has the following properties:
- Since the continuous random variable is defined over a continuous range of values (called the domain of the variable), the graph of the density function will also be continuous over that range.
- The area bounded by the curve of the density function and the x-axis is equal to 1, when computed over the domain of the variable.
- The probability that a random variable assumes a value between a and b is equal to the area under the density function bounded by a and b.
The distribution of a continuous random variable can be characterized through its probability density function (PDF). The CDF is derived from the PDF by summing the amount of probability in each increasing class in order to sum to one with the probability on the final class. The CDF is the integral of the PDF. We will also dig deeper into what is CDF and how is it useful in a different post. You can find the codes for the Abalone dataset in my Github profile.