EDA develops a solid grasp of the data, difficulties related to the information, and problems related to the process. It’s a methodical method for discovering the meaning behind the data. It is expected that appropriate EDA will completely address all questions pertaining to a given business decision, just as we expect specified responsibilities to be completed by any executive in a specific job role. Data science requires the best data aspects to be taken into account by the model because it entails constructing models for prediction. EDA makes sure that the proper patterns and trends are made available so that the model may be trained to produce the desired results, much like a good recipe. As a result, using the appropriate EDA tool and appropriate data will help accomplish the desired result.
We have discussed the basics of exploratory data analysis in our last blog, Exploratory Data Analysis in Machine Learning. With this blog we will be explaining different Exploratory Data Analysis (EDA) techniques.
Looking for a Data Science Career? Explore Here!
Types of Exploratory Data Analysis – EDA Techniques
Mainly we can classify the exploratory data techniques as three categories. They are Univariate, Bivariate and Multivariate.
1. Univariate Non Graphical
As we only use one variable to research the data, this is the most basic type of data analysis. Understanding the sample distribution and underlying data in order to draw conclusions about the population is the basic objective of univariate non-graphical EDA. The analysis also includes outlier detection. The following are some characteristics of population distribution:
Center tendency: The location of the distribution or central tendency has anything to do with average or middle values. Statistics with the names mean, median, and occasionally mode are frequently useful gauges of central tendency, with mean being the most prevalent. The median may be selected when there is a skewed distribution or when outliers are a concern.
Spread: Spread is a measure of how far away from the centre we should look to find the information values. The variance and quality deviation are two helpful measurements of spread. The variance is the root of the variance because it is the mean of the square of each unique deviation.
Skewness and kurtosis: The distribution’s skewness and kurtosis are two more helpful univariate descriptors. When compared to a normal distribution, kurtosis and skewness are measures of asymmetry and peakedness, respectively.
2. Univariate Graphical
The Auto MPG dataset, which is available on the UCI repository, is the basis for the graphics in this section. Typical examples of univariate graphics include:
- A very straightforward but effective EDA technique for condensing and presenting quantitative data is the stem-and-leaf plot. The values in the data set are displayed, with each observation remaining unaltered but being divided into stem (the leading digits) and leaves (the remainder or trailing numbers). But the histogram now mostly takes its place.
- Histograms or bar charts are used to show both grouped and ungrouped data. Variable values are represented on the x-axis, and frequency or the number of observations is plotted on the y-axis. With histograms, you may immediately comprehend your data and learn about its characteristics, such as its central tendency, dispersion, outliers, etc. A histogram, which is a bar plot with each bar representing the frequency, or count or proportion (the ratio of count to the overall count of occurrences), for various values, is the most basic type of graph.
- The boxplot is another extremely helpful univariate graphical method. Boxplots are great for communicating information about central tendency, demonstrating reliable measures of location and spread, as well as sharing information about symmetry and outliers, but they can be deceptive when presenting information about multimodality. In the form of side-by-side boxplots, one of the simplest applications for box plots may be found.
- The most complex EDA method is the ultimate univariate graphical one. The quantile-normal plot, often known as the QN plot or, more colloquially, the QQ plot. It is customary to assess how closely a given sample adheres to a certain theoretical distribution. It enables non-normality detection, skewness and kurtosis diagnosis.
Looking for a Data Science Career? Explore Here!
3. Multivariate Non Graphical
The purpose of a multivariate non-graphical EDA technique is often to illustrate the relationship between two or more variables in a cross-tabulation of statistical analysis.
Cross-tabulation is a tabulation extension that is particularly helpful for categorical data. Cross-tabulation is preferred when there are two variables involved. To do this, create a two-way table with column headings that correspond to the amount of one variable and row headings that correspond to the amount of the other two variables. Next, fill the counts with all of the subjects who share an equivalent pair of levels. We can provide statistical data for quantitative variables separately for each level of the particular variable for each categorical variable and one quantitative variable. The statistics are then contrasted across the various categorical variables. ANOVA can be performed informally by comparing the means, while one-way ANOVA can be more robustly performed by comparing medians.
4. Multivariate Graphical
Graphics are used in multivariate graphical data to show the connections between two or more knowledge sets. The only one that is frequently used is a grouped barplot, in which each bar within a gaggle represents the amount of the opposing variable and each group represents one level of one of the variables. Multivariate graphics can take various forms, including:
Scatterplot: The primary graphical EDA tool for two quantitative variables is the scatterplot, which displays a point for each example in your dataset for each variable on the x- and y-axes, respectively.
Run chart: A line graph of data that shows the progression of time.
Heat map: A graphical display of data in which values are shown by colour.
Multivariate chart: Shows the connections between causes and responses graphically.
Bubble chart: Multiple circles (bubbles) are displayed in a two-dimensional plot in a bubble chart, a type of data visualization.
Wrapping Up
With this article we have discussed the classification of exploratory data analysis techniques. Some key points about Exploratory Data Analysis is that it is subjective because it enumerates a dataset’s attributes and qualities. Data scientists can therefore select from the numerous plots presented in this article to analyze the data before implementing machine learning algorithms, depending on the project. Since exploratory data analysis is data-dependent by nature, we can refer to it as a strategy rather than a specific procedure. EDA uses visuals like graphs and plots to reveal hidden insights from data.
EDA can be carried out using both graphical and non-graphical statistical methods.
Looking for a Data Science Career? Explore Here!
Compared to multivariate analysis, univariate analysis is easier. Any EDA’s success will be based on the quantity and quality of data, the tools and visualizations selected, and the expert interpretation of the data by a data scientist.
Free Tutorials To Learn
SQL Tutorial for Beginners PDF – Learn SQL Basics | |
HTML Exercises to Practice | HTML Tutorial | |
DSA Practice Series | DSA Tutorials | |
Java Programming Notes PDF 2023 |