Understanding Data Through Exploratory Data Analysis

Understanding Data Through Exploratory Data Analysis

In the world of data, it is not wise to jump on a problem at hand as soon as we get the data. We need to work on it a little before we can do that. It is like taking part in a racing event. We first understand the car we are driving, get comfortable driving it and then the circuit where we will be racing. We always try to be very clear about these two things before we start the race, this will help us to give our best to the race. Similarly, once we get the data we should not directly jump into getting the end result. We should first try to be very comfortable with the data, know the data, get some interesting insights out of it and then remove some roadblocks which will be present in our data before moving to get our result.

This very first phase and the foundation step where we get to know and understand the data is called Exploratory Data Analysis. Normally when we look at the countless number of data points in our it is very difficult to make sense of the data. EDA helps us to make sense of data.

It Gives Some Very Interesting Insights Out Of Data Such As:

  • Listing the outliers and anomalies in our data
  • Identifying the most important variables
  • Understanding the relationship between variables
  • Checking for any errors such as missing variables or incorrect entries
  • Know the data types of the dataset – whether continuous/discreet/categorical
  • Understand how the data is distributed
  • Testing a hypothesis or checking assumptions related to a specific model

Exploratory data analysis (EDA) is very different from classical statistics. It is not about fitting models, parameter estimation, or testing hypotheses, but is about finding information in data and generating ideas. 

For example, imagine there is an application which helps novice musicians to learn different notes using the microphone in his phone which detects whether he is playing the right note or not and accordingly instruct him to improve wherever necessary. Once he learns the notes, it also helps them learn how to play popular compositions using the vast amount of data fed into the algorithm. Now they wanted to understand what is it that these new musicians like the most and how can they build a better engagement for their audience. On doing EDA, we uncovered some interesting insights which might make our client re-think their strategy about targeting only novice musicians. New musicians were using the app but majorly because of the engagement level and the referral of the intermediate level of musicians that the app become so popular.

Exploratory Data Analysis benefits stakeholders by confirming if the questions they’re asking are right or not and not just check whether the data is technically correct or not. Exploratory Data Analysis is all about getting to know and understand your data before making any assumptions about it or taking any steps in the direction of Data Mining. Exploratory Data Analysis provides utmost value to any business by helping scientists understand if the results they’ve produced are correctly interpreted and if they apply to the required business contexts.  

Even though there are many algorithms in Machine Learning, EDA is considered to be one of the most critical part to understand and drive the business. EDA also intuitively helps to identify whether the data which we are using has any logical errors. Take an example where we have individuals whose age is above 25 years and we have their corresponding strength. EDA here can be used to check the relationship between these two variables. In an ideal scenario, it should give us negative relationship wherein as the age goes up the strength of an individual would go down. However, if we do a 2D plotting of these variables and find that the strength is going up as age is going up then we will investigate as to why is it showing such an unusual behaviour so I can avoid any mistake when I am building Machine Learning models. If we miss the step of EDA, a lot of such problems go unnoticed and can negatively impact our end result. Performing this step right will give any organisation the necessary confidence in their data – which will eventually allow them to start deploying powerful machine learning algorithms. However, ignoring this crucial step can lead you to build your models on a very shaky foundation.

EDA being championed and promoted by John Tukey made way for many new developments. R programming language which is based on S is a direct inspiration of the EDA approach to data in the 1970s. There is an arsenal of tools which we can use from statistics, mathematics, etc. to do exploratory data analysis. Normally EDA uses methods such as univariate visualization and summary, bivariate and have visualization and relationship assessment between variables, dimensionality reduction to understand the features which have a significant impact on the data, clustering data into smaller groups for pattern recognition, etc. It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the multivariate EDA.

EDA gives us a lot of insights into the data provided we use it very well and to do that we should have a good domain knowledge or if we somewhat lack in that category we should try to work closely with someone who can fill in the gaps of our knowledge. EDA deals with a lot of visualization so not only the one who works on the data but all parties who have vested interested in the project will easily understand the data.

Some Tools Which We Use For Exploratory Data Analysis Are:

  • 2D Scatter plot:  A 2D Scatter plot as the name suggest would take the data point present in our data set and would then plot on a chart which will have a X-axis and Y-axis.
Iris Scatter plot with color coding
2D Scatter Plot: Iris Flower Dataset
  • Pair plots: A pair plot is a collection of 2D scatter plot which helps us to get an understanding of the features when they are more than 2 variables.
Simple Pair plot with class labels
Simple Pair plot with class labels
  • Histograms: Histogram is used to understand the frequency of a variable in a bar-type charts.
Simple Histogram
Simple Histogram
PDF abalone
PDF abalone
  • Cumulative Distribution Function:  The cumulative distribution function is the probability that the variable takes a value less than or equal to the value under consideration.
CDF of Abalone Dataset
CDF of Abalone Dataset
  • Other plotting tools such as 3D Scatter plot, Bi-plots, Contour plots, Box plots, Violin plots, Probability plots, Lag plots, Block plots, Youden plots, Stem and leaf plot, PhenoPlot, Scatterplot Smoothing, Diamond Plot, Sunflower Plot
  • Dimensionality reduction techniques such as Multidimensionality scaling, Principal component analysis, Multilinear PCA, Nonlinear dimensionality reduction
  • Projection methods such as grand tour, guided tour, and manual tour
  • Quantitative techniques such as Median polish, Trimean, Ordination
  • Data traces
  • Multi-vari chart
  • Run chart
  • Pareto chart
  • Bubble Chart
  • Parallel coordinates
  • Odds ratio
  • Targeted projection pursuit
  • Chernoff faces
  • Linear Regression
  • K-Means clustering
  • Rootogram
  • Resistant Time Series Smoothing
  • Resistant Curve Fitting
  • Wind Rose
  • Heat Map
  • Population Pyramid
  • Trend Analysis

We will over a period of time look into all of these tools and how to effectively use them for our data and keep on adding new tools to our arsenal.

Leave a Reply

Close Menu