If you are a beginner in python here is a simple way to do an ETL with BeautifulSoup and other libraries.

Heitor Hermanson
5 min readJun 20, 2021

--

In this article, I want to show a task that was asked to me to be solved for a job opportunity, one of the tasks was to write a script showing how to extract data from a website. Disclaimer: this is not the same task and is more inspiration about the same subject.

With that clear here I am going to show the code that I used to do the extract and how I fixed some types of problems which can happen when you extract the data and save it as a CSV file.

Imagem de mcmurryjulie por Pixabay

The work was to extracted tables from a website, and these tables show the prices of the quote milk and that was the main purpose of the work, but to do that what I needed to know was, how to do web scraping. I made some research and find there are some ways to do that, but here I am gonna show how to do that using python, specifically using the library BeautifulSoup.

Imagem de Couleur por Pixabay

The image below shows us the script that I used to extract the tables from the website. A few points I want to explain here are, first, the website which I used shows us the last price of the milk quote for each state in each month, and second, I extracted the most recent table to save in CSV file.

Image by Author: Script

The process of the scripts was:

1. Import the libraries, like Requests for executing HTTP requests, the BeautifulSoup for extracting data from HTML and XML files, and the Pandas to store, clear, and save data in a data frame.

2. Write a logic function to show if the request was successfully done or not, if the request code was 200, that means the page was found and the request was executed.

3. Use the BeautifulSoup to extract and save the tables, here we can use the find() or find_all() methods to get the tables, one you are going to specify and the other will get all the tables on the website,

4. Create the variable table_str to convert the tables into strings, I do that because the BeautifulSoup library answered the tables like an object, and to read the tables like a data frame I have to convert into strings,

5. Pandas to read the HTML file, and I wrote df[0] because now I specify the most recent table for the work.

Image by Author: Data extracted in table form

Now we have a table with the last quote for the last month, but we have two things to notice first is the column of prices per liter is not right because on the website that column was like R$/L 2.05 or something like that and here we have R$/L 20550.0. So this is the first issue, and the second one is about the data frame, the problem here is how BeautifulSoup answer the tables, the BeautifulSoup answered the tables like an object and not a list or a data frame, so now I have to convert it into a string and with that, I will have a list of data frames for each table.

To solve these problems, I found some ways, and one way I liked was to convert in a CSV file or Excel file and treat the data, because maybe if you are a beginner in this subject you can learn how to treat this type of issues in your data frame.

And as we can see I write df[0] in the script because I wanted to select this table for the simple reason this was the table I needed to use for the work. And instead of using the method find_all() I could use the find() and select just the table that I needed, but for teaching and learning, I chose the find_all().

Image by Author: CSV file

The figure above shows how to save the data frame into a CSV file and what happens when I read the file, so as we can see now I have three things to fix in this table, first is this new “Unnamed: 0” column, second is the row 10 and third are the prices of the quotes. With that clear, now I am gonna show how to treat these types of errors.

• The Unnamed column is just the index column and to solve that we can remove that by calling the index_col = 0;

• To remove the row of number 10 you only need to call the function df.drop ([10]) and this will be done.

  • I use the method .apply() in the column “Preço (R$/Litro)” to create a function to divide 10000 and repair the values there.
Image by Author: DataFrame

With all that made now we have tables in CSV format ready to be used for any type of work, you will have. So I hope this experience that I had and put together in this article can provide you some kind of help and I hope you have enjoyed the article.

--

--

No responses yet