Web Scraping - COVID-19 Data

Web scraping is the (generally automatic) process of collecting semi-structured data from the web, filtering and storing it, and then using it in another process.

Table of Contents

Motivation

Create a web scraper bot to obtain data on confirmed cases and deaths of COVID-19, in order to analyze them.

Process

  1. Run the Web Scraper with Selenium to obtain the historical data. It only runs 1 time.
  2. Run the Web Scraper with BeautifulSoup to obtain daily data (every day every x hours).
  3. Export the historical daily data in a CSV file, to feed the dashboard in Power BI.
  4. Use the COVID-19 dashboard (built in Power BI) to analyze data and find insights.

Data

The data obtained through web scraping are:

Variable Description
country Country name
total_cases Total number of cases
total_deaths Total number of deaths
total_recovered Total number of people recovered
active_cases Number of active cases
serious_critical Number of critical cases
total_tests Total number of tests
tot_cases_1m_pop Number of cases per one million population
deaths_1m_pop Number of deaths per one million population
tests_1m_pop Number of tests per one million population
datestamp Data timestamp with UTC-5 time zone

Furthermore, to carry out the complete data analysis and its respective visualization, other variables had to be derived, such as:

Variable Description Definition
perc_deaths Percentage of deaths total_deaths * 100 / total_cases
perc_infection Percentage of infections or contagions total_cases * 100 / total_tests
new_total_cases New daily cases total_cases_today - total_cases_yest
new_total_deaths New daily deaths total_deaths_today - total_deaths_yest
new_active_cases New daily active cases active_cases_today - active_cases_yest

You can find the scripts with which the tables were created in SQL Server here.

World COVID-19 data was collected over 253 days. The latest data reported by country can be seen at the following link

Below, some final statistics of the data updated until September 30 UTC+0:

Variable Value
Final Date 9/30/2020
Countries infected 213
Total Cases 34,134,840
Total Deaths 1,018,033
Active Cases 6,587,728
Total Tests 648,926,831

Analysis

  1. PCA Data Analysis
  2. Curve Similarity Analysis - Cases
  3. Curve Similarity Analysis - Deaths
  4. Similarity of the Curve Slopes

Dependencies

The project was carried out with the latest version of Anaconda on Windows.

If the main Web Scraping libraries do not come with the selected Anaconda distribution, you can install them with the following commands:

conda install -c anaconda pyodbc
conda install -c anaconda beautifulsoup4
conda install -c conda-forge selenium

The specific Python 3.7.x libraries used are:

# Import custom libraries
import util_lib as ul

# Import util libraries
import logging
import pytz
from pytz import timezone
from datetime import datetime

# Email libraries
import smtplib
import ssl
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

# Database libraries
import pyodbc

# Import Web Scraping libraries
from urllib.request import urlopen
from urllib.request import Request
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

# Import Web Scraping 2 libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Note: In order to use the web scraper that fetches historical data, you may need to download the Chrome driver that uses the Selenium library and put it in the driver folder.

Run Bot

There are several ways to run this web scraper bot on Windows:

  1. Type the following commands (below) at the Anaconda Prompt.
  cd "WebScraping_Covid19\code\"
  python web_scraper.py
  1. Type the following commands (below) at the Windows Command Prompt. Previously, Anaconda Python paths must be added to: Environment Variables -> User Variables.
  cd "WebScraping_Covid19\code\"
  conda activate base
  python web_scraper.py
  1. Directly run the batch file run-win.bat (found in the run/ folder).

Automate Execution

In order to automate the process, a Task can be created in the Windows Task Scheduler, to configure the execution of the web scraper bot every x hours.

task-sch-0-img

  1. Create a new Task in Windows Task Scheduler.

task-sch-1-img

  1. Set up a Trigger that runs the task every day every x hours.

task-sch-2-img

  1. In the Action tab, select the .bat file and the folder from where the Task will be executed.

task-sch-3-img

DataViz

Next, the COVID-19 dashboard that was created to visually analyze the collected data.

dataviz-img

Documentation

Below, some useful and relevant links to this project:

Contributing and Feedback

Any kind of feedback/criticism would be greatly appreciated (algorithm design, documentation, improvement ideas, spelling mistakes, etc…).

Author

License

This project is licensed under the terms of the MIT license.