Cumulative Distribution Function

## Cumulative Distribution Function – Formula, Properties, Differences

The cumulative distribution function or the cumulative density function or the CDF is the probability that the variable takes a value less than or equal to x. Cumulative in CDF as the name suggest is the addition of all the probabilities for the value x for which we are finding the CDF.

## To Understand The Concept Of CDF We Can Take The Example Of A Deck Of Cards.

We have in total 52 cards and the probability of getting a card with number 6 is Pr(6) = 4/52 or 0.08. However, if we want to find the probability of getting card with any number less than or equal to 6 we take the help of CDF. What CDF does is that it would calculate the probability of getting a card with number 2, then 3 till 6 and would add the probability of all the numbers.

In our case, it will be Pr(2) + Pr(3) + Pr (4) + Pr (5) + Pr(6) = 4/52 + 4/52 + 4/52 + 4/52 + 4/52 = 20/52 or 0.38.

## Difference Between Calculating Distribution Function For Discrete And Continous Random Variable:

• Discrete random variable: Here, in order to calculate the CDF for a function F(x), we can simply create two columns. In one column we can have the values and in the next column, we can have their corresponding probabilities. For whatever value we have to calculate the CDF, we have to just add the corresponding probability of all the values till that value.
• Continuous random variable: In this case, a CDF is very helpful because unlike a discrete random variable, continuous random variables are just generalization. What we observe is never completely true we always round our numbers to a certain decimal number such as one (0.1) or two (0.11) or so on depending on our use case. We cannot have two columns wherein we have values in one column and probabilities in other because the probability of observing any specific value will always be zero and adding all the zeros will only give us zero, thereby misleading us. Hence, we use the Cumulative Distributive Function to find the probability of a variable taking a value less than or equal to x for any given function.

## The Formula For Cumulative Distribution Function Are:

• Continuous random variable
• Discrete random variable

Here, F(x) accumulates all of the probabilities less than or equal to x.

## Below Is The Visual Representation Of CDF and PDF Of The Iris Setosa Flower:

```iris_setosa = iris.loc[iris['species'] == 'setosa'];#Plot CDF of petal_lengthcounts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10, density=True)pdf = counts/(sum(counts))#Compute CDFcdf = np.cumsum(pdf)plt.plot(bin_edges[1:], pdf)plt.plot(bin_edges[1:], cdf)plt.xlabel('Petal Length')plt.title('CDF and PDF for Iris Setosa Flower')plt.show()
```
• The horizontal axis is the variable under consideration is the ‘Petal Length’. The horizontal axis is the variable factor in a CDF.
• The vertical axis is the probability which must fall between zero and one. The vertical axis remains the same for all CDF and increases as we move from left to right side on the horizontal axis.

## If We Look At The Image Above, The CDF Here Can Be Used To Answer Questions Such As:

• What is the probability that the length of the petal will be less than 1.8?
• What is the probability that the length of the petal will be greater than 1.8?
• What is the probability that the length of the petal will be between 1.6 to 1.8?

## CDF Can Be Great Tool For Data Visualization When It Is Used Effectively Such As By Adding More Than One CDF In A Plot:

```#Petal_length - Virginicacounts, bin_edges = np.histogram(iris_virginica['petal_length'], bins=10, density=True)pdf = counts/(sum(counts))cdf = np.cumsum(pdf)plt.plot(bin_edges[1:], cdf, label = 'Virginica')#Petal_length - Versicolorcounts, bin_edges = np.histogram(iris_versicolor['petal_length'], bins=10, density=True)pdf = counts/(sum(counts))cdf = np.cumsum(pdf)plt.plot(bin_edges[1:], cdf, label = 'Versicolor')plt.xlabel('Petal Length')plt.legend()plt.show()
```

Here, in the above image we can see that we have two CDF. One for Virginica and one for Versicolor species of Iris Flower. When we look at the CDFs of their Petal length we can conclude that around 12% of Virginica species have petal length with size less than 5 and 97% of Versicolor species have petal size less than 5. This can help us to check the error rate of some simple if else machine learning model to distinguish the three species of Iris flowers using only one variable. The above figure tells us the error rate if we go less than 5.0 as our petal length which is 3% for Versicolor (quite good) but 12% for Virginica (Little on the higher side). We can possibly shift a little more to the left and get a better error rate for both the species and then go ahead with our model.

The CDF of a continuous random variable X can be expressed as the integral of its probability density function ƒX. To understand this point let’s look at an example.

## Below You Will Find The Probability Density Function For The Abalone Dataset:

```sns.set_style("whitegrid");sns.FacetGrid(abalone, size = 5) \    .map(sns.distplot, 'Length', hist=False, bins=30, kde=True) \    .add_legend();plt.show()
```

The CDF for the above PDF will be:

```counts, bin_edges = np.histogram(abalone['Length'], bins=25, density=True)pdf = counts/(sum(counts))#Compute CDFcdf = np.cumsum(pdf)plt.plot(bin_edges[1:], cdf)plt.show()
```

Here, if we can see that the PDF and CDF are inter-related. The integration of a PDF gives us CDF and when we differentiate CDF we get PDF.

The slope of our CDF depends on our mean and standard deviation. If our standard deviation value is low i.e. if our spread is very narrow then we will have a steep curve and when our standard deviation is high i.e. when we have a wider distribution plot then we will have a flatter curve.

## Properties Of Cumulative Distribution Function:

• F is non-decreasing and right-continuous, which makes it a cadlag function
• F(x) goes to 1 as x tends to positive infinity.
• F(x) goes to 0 as x tends to negative infinity.

## Some Of The Areas Where CDF Is Used Are:

• It is used in the field of rendering in computer graphics
• They are used in computing critical values, P-values and power of statistical tests

## Some Of The Derived Function From Cumulative Distribution Function Are:

We will go over these functions in more detail in our coming posts. To get the python notebook used for this post in my github profile.