In a project I need to collect the name and address of all dental laboratories in Taiwan. Unfortunately the Ministry of Health and Welfare doesn't provide a structured format(csv, json...etc) for download. The data only available as website tables having just 10 records each page, and there are about 1000 pages... so I got to crawl it without having other faster and simpler options(or maybe I should contact the government officials, which is faster?? 🤔...).
Choosing Web Crawler
Scrapy is a highly structured framework. I need to follow its structure to write python classes and middle-wares if using it. Then use CLI to run the crawling. The well structured format does provide clear separation of concerns. Its good for creating serious crawler, especially when working with a team that each member responsible for a specific part of a project. However, its a bit overkill for an one-man simple crawling project like this one, so maybe next time.
Playwright, as a pytest plugin in python, its more test oriented and depend on pytest with CLI. This added an other layer of complication, also not well suited in jupyter's exploratory programming style. Therefore 🙅♀️
Another essential benefit of using jupyter is that, I can easily "deploy" the program online using Binder, so the users can test it immediately by simply pressing , and tell me if it fit their needs. This fast feedback loop significantly accelerates the development cycle 👍
So, the final stack for this simple crawler is: python + selenium + jupyter -> Binder 🎉
- run online with
- or install and run locally
- clone repo
- install dependencies
- Jupyter: exploratory programming in python
- effectively try out any CSS selector / XPath combinations
- First use normal mode to have visual feedback while exploring XPaths
- then use headless mode in production
- Tesseract: Google's OCR library for recognizing captcha
- pil: create image object from the captcha binary retrieved by Selenium