Collection of web scraping scripts, spiders and crawlers

Collection consists of two parts:

Scrapy spiders and crawlers built with scrapy and splash
Asynchronous web scraping script using BeautifulSoup in combination with asyncio and aiohttp

Scrapy spiders and crawlers

There are 7 scrapy spiders and crawlers organized in 4 scrapy projects, all living in scrapy-splash folder.

best_movies crawler, scraping from live www.imdb.com. Crawler visits 'Top 250 best movies of all times' ranking which is split into 5 pages, 50 movies each. Then it proceeds to visit each movie's URL to scrap detailed data about each movie.
books crawler, scraping from scraping http://quotes.toscrape.com/. 1000 books. Deals with pagination. 10 pages with 100 books each.
writers spider, scraping also from http://quotes.toscrape.com/. 83 quotes from 10 different pages. Deals with pagination and JavaScript, dynamically generated content using splash.
special_offers spider scraping 500 products from archived version, of currently defunct electronics, e-commerce store www.tinydeal.com.
fancy_glasses spider scraping 250 products from live eyeware e-commerce store https://www.glassesshop.com/
countries spider scraping 4200 population datapoints for countries around the world between 1995 and 2020 from live economic page https://www.worldometers.info
debt_to_gdp spider scraping current Debt-to-GDP ratio for 173 countries, also from https://www.worldometers.info

Asynchronous web scraping script using `BeautifulSoup` with `asyncio` and `aiohttp`

This is version 2.0 of a web scraping script which takes course category as input, scrapes all available courses from Coursera.org for the chosen category and saves them as an Excel file on your local drive. If you don't want to download the file and see the output in the console instead, comment out line 170 df.to_excel('courses_final.xlsx', index=False, engine='xlsxwriter').

Version 2.0 is 8 times faster on average than version 1.0. Performance improvement was achieved by replacing synchronous http requests with asychronous http requests.

Example Output

How to run this script locally?

Create an isolated virtual environment

conda create -n scraper python=3.11

conda activate scraper

conda install -c conda-forge requests aiohttp beautifulsoup4 pandas lxml xlsxwriter

Open `async_coursera_scraper.py` in your favorite file editor

Uncomment last line #scrap('math-and-logic')

You can leave math-and-logic as an argument or you can replace it with one from the following categories: ['data-science', 'business', 'personal-development', 'language-learning', 'math-and-logic', 'physical-science-and-engineering]

Run the script

python async_coursera_scraper.py

Voilla! After about 15 seconds you should see an Excel file with the scraping results on your hard drive.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
images		images
scrapy-playwright/playground		scrapy-playwright/playground
scrapy-selenium		scrapy-selenium
scrapy-splash		scrapy-splash
.gitignore		.gitignore
README.md		README.md
async_coursera_scraper.py		async_coursera_scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Collection of web scraping scripts, spiders and crawlers

Scrapy spiders and crawlers

Asynchronous web scraping script using `BeautifulSoup` with `asyncio` and `aiohttp`

Example Output

How to run this script locally?

Create an isolated virtual environment

Open `async_coursera_scraper.py` in your favorite file editor

Run the script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

bartosz-bear/web_scraping

Folders and files

Latest commit

History

Repository files navigation

Collection of web scraping scripts, spiders and crawlers

Scrapy spiders and crawlers

Asynchronous web scraping script using BeautifulSoup with asyncio and aiohttp

Example Output

How to run this script locally?

Create an isolated virtual environment

Open async_coursera_scraper.py in your favorite file editor

Run the script

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Asynchronous web scraping script using `BeautifulSoup` with `asyncio` and `aiohttp`

Open `async_coursera_scraper.py` in your favorite file editor

Packages