Many a times, we need datasets to perform data related tasks like data analytics, data cleaning and machine learning as such. Most data science courses taught in college and online just make the dataset available to you.
I wanted to explore the data creation part so I can progress forward in my data science curriculum.
I can create a dataset using a spreadsheet software but that is very boring and time-consuming. So I wanted to automate this task and also try out Puppeter.
The site I am scraping is Naukri. It is a job aggregator that does not provide any public API for consumption. This makes it the perfect choice for web scraping.
This project was made as a part of the assesment of my Cloud Computing class.
- NodeJS Runtime
- npm (Node Packet Manager)
- Chromium or Chrome Browser (You don't have to install them seperately if you don't have them already. Installing the package automatically installs the browser)
- Azure Account
If you wanna try this on your computer, clone the repository and install the dependencies using npm or yarn
git clone https://github.com/VarunGuttikonda/WebScraper.git  
cd WebScraper  
npm install- Puppeteer
 A NodeJS library to run chrome in headless mode and automate it using theDevToolsprotocol
- Objects-to-csv
 A library to convertJSONobjects intoCSVstrings and vice-versa
npm install puppeteer objects-to-csv- async/awaitfrom JavaScript
- ElementHandleand- JSHandlefrom Puppeteer
- Connection Stringsand- Bindingsfrom Azure Functions
Azure Functions is a Functions-as-a-Service application from Microsoft Azure. It provides serverless compute options to perform computations that can be written as a single function.
The following files serve as the configuration of this project:
- function.json- Defines the properties of functions like names of the bindings, type of the bindings and their connection strings along with their direction ie binding is input or output. Should be defined for each function seperately
- host.json- Describes the properties of the host to which the function will be deployed. Details like packages to be installed, extensions to use are a part of this file
- local.settings.json- Contains the metadata used by Azure Functions Core Tools to assist the developer in testing the app
- package.json- Contains the metadata of the project like name,author, GitHub link, packages used etc
- .gitignore- Has a list of file names that the VCS (Git) shouldn't be tracking
- scrape.js- Exports the main scrape function named as- scrape. This function takes care of creating the- Browserand- Pageobjects. It later then scrapes all the jobs on the site
- constants.js- This file contains all the configurations like- HTML Selectors, config file for- Browserobject etc.
- utils.js- Has utilities for error hanlding and printing to the console
- scrapeUtils.js- Contains the code for navigating, clicking and scraping the website that were used in the- scrapefunction
This application was deployed to Azure Functions. If you want to deploy it to any other cloud platform, please use the
scrape.js,constants.js,scrapeUtils.js and utils.jsas the base files. These export the scraping functionality.