Pair Plots

Pair Plots

A Simple 2D Scatter plot is used to understand the relationship or pattern between two variables or dimensions in our dataset. A 3D plot will be used for three variables or dimensions. However, what would do if we have more than 3 dimensions or features in our dataset as we humans do have the capability to visualize more than 3 dimensions? One solution to this problem is pair plots. It is one of the most effective starting tools.

They are used to plot features when we have more than three dimensions. As the name suggests we actually do pairs of features and plot them all.

For example, let’s say we have four features Name, Place, Animal and Thing in our dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs in this case will be i. (Name, Place); ii. (Name, Animal); iii. (Name, Thing); iv. (Place, Animal); v. (Place, Thing) and vi. (Animal, Thing).

So, here instead of trying to visualise four dimensions which is not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.

Some Basic Points To Remember Whenever We Are Looking At Pair Plots:

  1. We will receive a matrix of plots where we will have an equal number of plots on both sides of the diagonal which would be the mirror image of each other. So, we should either look at the plots below the diagonal or above the diagonal, unless we use pairgrids.
  2. We should always try to understand the legends

It is simply amazing to see just one simple line of code to gives us the entire plot.

Instead of writing codes for plotting 2D scatter plot individually. We can just write one line of code and we have our pair plots.

#Creating a simple pair plot
Simple Pair Plot
Simple Pair Plot

We can make these charts more interesting by assigning them class labels

#Giving the pair plot class labels
sns.pairplot(knowledge, hue = ' UNS')
Simple Pair plot with class labels
Simple Pair plot with class labels

As Seen Above The Pair Plots Can Be Divided Into Three Parts:

  • The diagonal plot which showcases the histogram. The histogram allows us to see the distribution of a single variable
  • Upper triangle and lower triangle which shows us the scatter plot. The scatter plots show us the relationship between the features. These upper and lower triangles are the mirror image of each other.

However, python gives us some amazing tools to customise our pair plots in the form of pair grids. Here, we can customise as to what we want to see in our three parts of the plot. For example: We can see that in the figure mentioned below, we have the option of choosing different types of plots in the upper triangle, diagonal and the lower triangle.

# Create a pair plot colored by the knowledge
level of user with a density plot of the # diagonal and format the scatter
sns.pairplot(knowledge, hue = ' UNS', diag_kind = 'kde', palette='hls', plot_kws = {'alpha': 0.9, 's': 80, 'edgecolor': 'k'},              size = 4)
Pair grid
Pair grid
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data= knowledge, size = 4) # Map a scatter plot to the upper triangle
grid = grid.map_upper(plt.scatter, color = 'darkred')
# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins = 10, color = 'darkred',                      edgecolor = 'k')
# Map a density plot to the lower triangle
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
Pair grid
Pair grid

With the help of these customisation we can have some elementary visualisation of data which helps us find patterns and relationship in our data. This helps us to have a great start in understanding our project.

The advantage of having pair plot would be to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our dataset.

However, pair plots have a major disadvantage.

Let’s say we have 10 features or 100 features in our dataset instead of 4. In such a case we would have 100C2 or 1000C2 plots which would be very difficult to go through as the sheer number of plots would be too high to make sense of them.

So, pair plots are easy to understand when the number of features or dimensions are low say 4, 5 or even 6 as we can quickly go through all the plots and see any trends. But when the number of dimensions is very high say 10 or 100 or 1000 then pair plots are not of much help unless and until we are very sure as to which 5-6 features we can to visualize. In such cases, we use dimensionality reduction techniques like Principal Component Analysis (PCA) or T-SNE which will help us visualize the data. We will try to understand those concepts too over the period of time. The ipython notebook for this article can be found on my github profile.

Leave a Reply

Close Menu