How to start the EDA method in a few steps

Heitor Hermanson
Analytics Vidhya
Published in
5 min readMay 24, 2021

--

Let us imagine that you take your covid-19 vaccine, and you want to go to the beautiful city of Rio de Janeiro. But before you go to do this travel, first you must decide some information, like where you want to stay, like the neighborhood, the prices of the places, the number of nights and other things?

To do that you can look to some apps that works find places to stay for a chip price and try to find the best deal to do the trip, but the answers you receive from the app do not match your ideas, so now you want to dig even more about how to do the best trip for the low cost and best view or place to stay.

Now to make that happen, you can access the data for one of these apps and look at the data which is available on their websites. For this example, I am going to use the data which is available on the Airbnb site for the City of Rio de Janeiro. With this step done, we now must investigate the data set and try to find the best options for the initial proposal. And to do that we can explore the data set through the EDA method, Exploratory Dada Analysis (EDA), where we are going to answer these questions and help us to do a better analysis in the dataset.

Exploratory Data Analysis is a method that is going to bring clarity to the dataset and made the data more reliable for the job because this method works by investigating the data by describing the data set and using statistical tools to analyze the raw dataset, and through that realize how good is the dataset or not.

Like was said before, the EDA method can show to us the quality of the dataset, and this works because this method consists in check some points about the dataset, like the missing values or Nan values, the outliers, the type of data for each variable, the size of the dataset, and more.

So now, I am going to show and explain some parts about this investigative work, to start that we need to know which variables we can use or not, and to do that, we must know the types of the data, like as shown in the figure below:

In the figure above, we have the description work, because one of the first things we want to know are the types of the variables. After discovering the types for each variable, we can check it has not any problem in the data. The figure above does not have a problem, but even without problems, this is a part where we must do that helps us to check the quality of the dataset. With this done now, we are going looking to search where are the Missing data or Nan values in our dataset and what is the weight of each variable like is shown in the next figure.

As we can see, in the figure above, we made an investigation, the quantity of Nan values or missing values for each variable, and that helps us to decide what columns we can take for the project and what is better to use or not. So, after doing these two steps, now we can start to look at the data and take some statistics measures to build an understanding of how the distribution of the data appears, like the mean, the mode, the standard variance, because if it makes sense, we can use one of these measures to fulfill the missing values, and this will do less damage to analyze than work with missing points or Nan values.

So looking at the figure above we can notice a few points, like the max of the column price is far away than the 75% max, so maybe this means we have an outlier here, another thing we can see is in the column minimum_night, the max is 1000 and this is a wrong data because in a year the max number of days is 365 or 366, so this can be a wrong data and we have to avoid to use it.

Now with these two steps done, if you want to check or verify other points within the data set, you can plot histograms, like in the figure below.

Here we can see the histograms about all the columns and through that decide which variable is ok to use or no, as the histogram about the variable price has some issues, like the outlier, the outlier change the real meaning of the data. The same thing happens to the variable minimum night there we can see the same thing but in a different proportion.

And for sure, you can dig much more in your data set and refine the analysis, and this is the beauty of the EDA method we can use all these statistics tools to improve the analysis and throughout that build a business plan much more accurately.

This article is just the first about how to look at your data and take some important notes. The images that I used in this article are from the first project that I made about data science, feel safe to check if you want here.

--

--