Back to the data … thanks to web scraping!
Focus on the first fundamental step of data mining: data collection
Reading Time: 3 minutes
Back to the future thanks to trustworthy reliable data? “Great Scott!”
We all agree that we’re data-dependent. Every decision we make and every subsequent action we take starts with data, its analysis and interpretation and finally forecasting.
That sounds simple, but … what data can we get? where is it? how do I do it?
The purpose of this short article is to answer these questions!
Data
Any data mining process starts with the data. So, the Gordian knot of the matter are the data.
In this sense, these cases can occur:
- there is a database of data
- let’s create a database
- we scrape the data from reliable sources
We’re going to focus on getting the data from a source…
Extract data from the web
Very often it happens that the data we are interested in is not available in the form of files but only as data present in web pages.
What you need to do here is to get this data from a web page.
We can do this manually, by writing the data into a spreadsheet, or automatically using special tools.
In this case we are talking about web scraping or web data extraction.
Web Scraping
It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
It is also a technique that is much discussed today as it is also used for non-legal purposes, including the undercutting of prices and the theft of copyrighted content.
In reality the real problem is not the technique used, just think that it is an automatism to read and copy data manually, but the data itself.
In fact, the data controller and the site may not allow the use of the data.
It is good practice to always know the policy of the site from which you want to scrape the data before using them.
Excel and Web Scraping
Excel also allows the web scraping technique to prove that it is not illegal.
It does so through two procedures:
- from menu “Data/Web” … but does not always work correctly
- from “Data/New Query” from version 2016 Excel
Data Extraction methods
An overview of possible methods for data extraction
- Via Files (XML, CSV, JSON)
- Via Excel / Power BI
- Via API from source
- Via Source Code (VBA, Python and Libraries o direct from DOM)
- Browser Plugin
- Via Dedicated Software
Web Scraping software
Here is our list of the best web scraping tools on the market right now:
- Octoparse
- ParseHub
- Scrapy
- DataMiner
- Dexi.io
Last but not least … nostopitWebTableExtractor our free software to extract data and tables from web pages and files.
If you want to try it now this is the link where you can download it:
https://www.nostopit.com/software/nostopitwebtableextractor/
any feedback from you is welcome!