Collection consists of two parts:
- Scrapy spiders and crawlers built with
scrapyandsplash - Asynchronous web scraping script using
BeautifulSoupin combination withasyncioandaiohttp
There are 7 scrapy spiders and crawlers organized in 4 scrapy projects, all living in scrapy-splash folder.
-
best_moviescrawler, scraping from livewww.imdb.com. Crawler visits 'Top 250 best movies of all times' ranking which is split into 5 pages, 50 movies each. Then it proceeds to visit each movie's URL to scrap detailed data about each movie. -
bookscrawler, scraping from scrapinghttp://quotes.toscrape.com/. 1000 books. Deals with pagination. 10 pages with 100 books each. -
writersspider, scraping also fromhttp://quotes.toscrape.com/. 83 quotes from 10 different pages. Deals with pagination and JavaScript, dynamically generated content usingsplash. -
special_offersspider scraping 500 products from archived version, of currently defunct electronics, e-commerce storewww.tinydeal.com. -
fancy_glassesspider scraping 250 products from live eyeware e-commerce storehttps://www.glassesshop.com/ -
countriesspider scraping 4200 population datapoints for countries around the world between 1995 and 2020 from live economic pagehttps://www.worldometers.info -
debt_to_gdpspider scraping current Debt-to-GDP ratio for 173 countries, also fromhttps://www.worldometers.info
This is version 2.0 of a web scraping script which takes course category as input, scrapes all available courses from Coursera.org for the chosen category and saves them as an Excel file on your local drive. If you don't want to download the file and see the output in the console instead, comment out line 170 df.to_excel('courses_final.xlsx', index=False, engine='xlsxwriter').
Version 2.0 is 8 times faster on average than version 1.0. Performance improvement was achieved by replacing synchronous http requests with asychronous http requests.
conda create -n scraper python=3.11
conda activate scraper
conda install -c conda-forge requests aiohttp beautifulsoup4 pandas lxml xlsxwriter
Uncomment last line #scrap('math-and-logic')
You can leave math-and-logic as an argument or you can replace it with one from the following categories:
['data-science', 'business', 'personal-development', 'language-learning', 'math-and-logic', 'physical-science-and-engineering]
python async_coursera_scraper.py
Voilla! After about 15 seconds you should see an Excel file with the scraping results on your hard drive.
