Pair Plots
Do You Have The Right Tool For Exploring Your Data?

Pair Plots

Pair Plots

Why do we need Pair Plots?

A Simple 2D scatter plot is used to understand the relationship or pattern between two variables or features in our dataset. A 3D plot will be used for three variables or features. However, what would we do if we have more than 3 dimensions or features in our dataset as we humans do not have the capability to visualize more than 3 dimensions? One solution to this problem is pair plots. It is one of the most effective starting tools in our arsenal to do EDA.

They are used to plot features when we have more than three dimensions. As the name suggests we actually do pairs of features and plot them all.

For example, let’s say we have four features Name, Place, Animal and Thing in our dataset. In that case, we will have 4C2 plots i.e. 6 unique plots. The pairs, in this case, will be i. (Name, Place); ii. (Name, Animal); iii. (Name, Thing); iv. (Place, Animal); v. (Place, Thing) and vi. (Animal, Thing).

So, here instead of trying to visualize four dimensions which is not possible. We will look into 6 2D plots and try to understand the 4-dimensional data in the form of a matrix.

Some basic points to remember whenever we are looking at pair plots:

  1. We will receive a matrix of plots where we will have an equal number of plots on both sides of the diagonal which would be the mirror image of each other. So, we should either look at the plots below the diagonal or above the diagonal, unless we use pairgrids.
  2. We should always try to understand the legends

It is simply amazing to see just one simple line of code to gives us the entire plot.

Instead of writing codes for plotting 2D scatter plot individually. We can just write one line of code and we have our pair plots.

#Creating a simple pair plot
Pair plots
A Simple Pair Plot

We can make these charts more interesting by assigning them class labels

#Giving the pair plot class labels
sns.pairplot(knowledge, hue = ' UNS')
Pair plots
Pair plots with class labels

As seen above the pairs plots as can be divided into three parts:

  • The diagonal plot which showcases the histogram. The histogram allows us to see the distribution of a single variable
  • Upper triangle and lower triangle which shows us the scatter plot. The scatter plots show us the relationship between our features. These upper and lower triangles are the mirror image of each other.

However, python gives us some amazing tools to customize our pair plots in the form of pair grids. Here, we can customize as to what we want to see in our three parts of the plot.

For example, we can see that in the figure mentioned below, we have the option of choosing different types of plots in the upper triangle, diagonal and the lower triangle.

# Create a pair plot colored by the knowledge level of user with a density plot of the # diagonal and format the scatter plots.
sns.pairplot(knowledge, hue = ' UNS', diag_kind = 'kde', palette='hls',
             plot_kws = {'alpha': 0.9, 's': 80, 'edgecolor': 'k'},
             size = 4)
Pair grid
Pair grid with density plot on the diagonal
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data= knowledge, size = 4)
# Map a scatter plot to the upper triangle
grid = grid.map_upper(plt.scatter, color = 'darkred')
# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins = 10, color = 'darkred',
                     edgecolor = 'k')
# Map a density plot to the lower triangle
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
Pair grid
Pair grid with three different type of plots

With the help of these customizations, we can have some elementary visualization of data which helps us find patterns and relationship in our data. This helps us to have a great start in understanding our project. If you would like to look at some more customization you can check my github profile for this dataset.

The advantage of having pair plot would be to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make a linear separation in our dataset.

However, pair plots have a major disadvantage:

Let’s say we have 10 features or 100 features in our dataset instead of 4. In such a case we would have 100C2 or 1000C2 plots which would be very difficult to go through as the sheer number of plots would be too high to make sense of them.

So, pair plots are easy to understand when the number of features or dimensions are low say 4, 5 or even 6 as we can quickly go through all the plots and see any trends. But when the number of dimensions is very high say 10 or 100 or 1000 then pair plots are not of much help unless and until we are very sure as to which 5-6 features we can to visualize. In such cases, we use dimensionality reduction techniques like Principal Component Analysis (PCA) or T-SNE which will help us visualize the data. We will try to understand those concepts too over the period of time.


This Post Has One Comment

Leave a Reply