2D Scatter plot is one of the simple and very useful plotting tool used in Exploratory Data Analysis. A 2D Scatter plot would take the data points in our dataset for the two-axis and would plot it on a chart. The position of a point depends on its two-dimensional value, where each value is a position on either the horizontal or vertical dimension. This plotting of points is called scatter plot because we are scattering all the points from our dataset on a map.
A 2D Scatter Plot Would Primarily Consists Of Three Components:
- X axis
- Y axis
- Scale of the plot
However, we can also add a different
It is very important to have all the three basic components in order to have a proper understanding of plot. After looking at the X and Y axis care should be taken to understand the scale too because it might be a case that the point of origin might be (0,0). If we are including different colours for classification we should add labels to understand which colour represent which label.
Scatter plots are used to understand the relationships between variables. For example, we try to understand whether there is any relationship between the increase or decrease in the value of one variable.
Take the example of the result of Maths for a particular class. If we have the number of hours each student has studied on an average or cumulative on Y-axis and the marks they have received of x-axis then it will help us see the trend and the shift in the scores from the minimum to maximum score. Definitely, correlation does not lead to causation but with good domain knowledge
A simple 2D Scatter plot will help us understand the endpoint and the concentration of the points on a map for the two-axis under consideration. However, if we have a color label and would like to understand how it fares vis-à-vis these two points we can do a special color coding for them to understand whether there is a particular pattern which these class labels or dependent variables follow. In such a case, we should have labels to understand what color represents which label. A very widely used example for this is the iris dataset where we can not only see the cluster of ‘Setosa’ species is separately clustered when we
We can also move ahead and add one more dimension to our dataset which is the size. Apart from plotting our data points in the scatter plot and / or colour coding them we can add size to our points. For example, let’s assume we are studying the noble prizes won by developed countries. We can have the country index in x axis and the number of noble prizes won in y axis. Here, we can also add the population dimension to the points and visualize the population of various countries and how many noble prizes they have won. Here, if we take out the over performing country United States then we can have a better picture of the whole set of countries. United States not only has a huge population but it has received almost three time more prizes than United Kingdom which comes at the second spot.
Just for fun I have also added one more dimension to the Iris Dataset to see how the ‘petal width’ size helps us differentiate species.
Scatter plots are a great visualisation tools and an integral part of EDA. If you have any queries or feedback you can get in touch with me on firstname.lastname@example.org. The ipython notebook for the above example you can be found on my github profile.