Web Scraping - COVID-19 Data
Web scraping is the (generally automatic) process of collecting semi-structured data from the web, filtering and storing it, and then using it in another process.
Table of Contents
- Run Bot
- Contributing and Feedback
Create a web scraper bot to obtain data on confirmed cases and deaths of COVID-19, in order to analyze them.
- Run the Web Scraper with Selenium to obtain the historical data. It only runs 1 time.
- Run the Web Scraper with BeautifulSoup to obtain daily data (every day every x hours).
- Export the historical daily data in a CSV file, to feed the dashboard in Power BI.
- Use the COVID-19 dashboard (built in Power BI) to analyze data and find insights.
The data obtained through web scraping are:
|total_cases||Total number of cases|
|total_deaths||Total number of deaths|
|total_recovered||Total number of people recovered|
|active_cases||Number of active cases|
|serious_critical||Number of critical cases|
|total_tests||Total number of tests|
|tot_cases_1m_pop||Number of cases per one million population|
|deaths_1m_pop||Number of deaths per one million population|
|tests_1m_pop||Number of tests per one million population|
|datestamp||Data timestamp with UTC-5 time zone|
Furthermore, to carry out the complete data analysis and its respective visualization, other variables had to be derived, such as:
|perc_deaths||Percentage of deaths||total_deaths * 100 / total_cases|
|perc_infection||Percentage of infections or contagions||total_cases * 100 / total_tests|
|new_total_cases||New daily cases||total_cases_today - total_cases_yest|
|new_total_deaths||New daily deaths||total_deaths_today - total_deaths_yest|
|new_active_cases||New daily active cases||active_cases_today - active_cases_yest|
You can find the scripts with which the tables were created in SQL Server here.
World COVID-19 data was collected over 253 days. The latest data reported by country can be seen at the following link
Below, some final statistics of the data updated until September 30 UTC+0:
- PCA Data Analysis
- Curve Similarity Analysis - Cases
- Curve Similarity Analysis - Deaths
- Similarity of the Curve Slopes
The project was carried out with the latest version of Anaconda on Windows.
If the main Web Scraping libraries do not come with the selected Anaconda distribution, you can install them with the following commands:
conda install -c anaconda pyodbc conda install -c anaconda beautifulsoup4 conda install -c conda-forge selenium
The specific Python 3.7.x libraries used are:
# Import custom libraries import util_lib as ul # Import util libraries import logging import pytz from pytz import timezone from datetime import datetime # Email libraries import smtplib import ssl from email.mime.multipart import MIMEMultipart from email.mime.text import MIMEText # Database libraries import pyodbc # Import Web Scraping libraries from urllib.request import urlopen from urllib.request import Request from urllib.error import HTTPError from urllib.error import URLError from bs4 import BeautifulSoup # Import Web Scraping 2 libraries from selenium import webdriver from selenium.webdriver.chrome.options import Options
Note: In order to use the web scraper that fetches historical data, you may need to download the Chrome driver that uses the Selenium library and put it in the driver folder.
There are several ways to run this web scraper bot on Windows:
- Type the following commands (below) at the Anaconda Prompt.
cd "WebScraping_Covid19\code\" python web_scraper.py
- Type the following commands (below) at the Windows Command Prompt. Previously, Anaconda Python paths must be added to: Environment Variables -> User Variables.
cd "WebScraping_Covid19\code\" conda activate base python web_scraper.py
- Directly run the batch file run-win.bat (found in the run/ folder).
In order to automate the process, a Task can be created in the Windows Task Scheduler, to configure the execution of the web scraper bot every x hours.
- Create a new Task in Windows Task Scheduler.
- Set up a Trigger that runs the task every day every x hours.
- In the Action tab, select the .bat file and the folder from where the Task will be executed.
Next, the COVID-19 dashboard that was created to visually analyze the collected data.
Below, some useful and relevant links to this project:
- Create and Populate Date Dimension
- Python Web Scraping Tutorials
- 10 Web Scraping Tools (Spanish)
- Run Anaconda Python in CMD
- Schedule a Batch file to run Automatically
- Get information about countries via a RESTful API
- Sending Emails With Python
Contributing and Feedback
Any kind of feedback/criticism would be greatly appreciated (algorithm design, documentation, improvement ideas, spelling mistakes, etc…).
- Created by Andrés Segura Tinoco
- Created on Apr 10, 2020
This project is licensed under the terms of the MIT license.