Web Scraping unstructured dataλ︎
Complexity of JavaScript websitesλ︎
- Sparkledriver
Toolsλ︎
- Enlive
- Scoopi - a tool to extract and transform data from web pages
- demeter - fast, concurrent web scraper with headless JavaScript execution
- web-scraper - library with fairly good JavaScript support
- Web Page Summarizer - gui for getting web pages and summarizing them. Demonstrates enlive and compojure
- Scraper - JavaFX web engine and WebKit
- Abrade - scraping web sites, even ones that heavily rely on Javascript. The Java HtmlUnit library is used under the hood
- Etaoin - Clojure implementation of webdriver protocol
Example projectsλ︎
- parkrun-app - enlive
- clj-scraper - enlive, http-kit, core.async
- ldnpyvideo - Scraper (from pyvideo.org) and web site for London PyCon video meetup
- nba-scraper - scraping NBA boxscore data from ESPN
- Clojure web scraping with Enlive
Referencesλ︎
- Practicalli: Web Scraping with Clojure - Hacker news
- Clojure Data Analysis Cookbook: Scraping data from tables in web pages
- ClojureVerse: First time webscraper, could you give any pointers?
- Web Scraping with Clojure - http-kit and Enlive
- How to Scrape Modern Websites Without Headless Browsers - Python
Hint::Be Respectful of data sourcesλ︎
Avoid high number of requests to websites with unstructured data, they are unlikely to have much capacity to serve requests. Consider downloading the content locally to minimise the requests to the website.