I still have a daily job running a web scraper I first wrote with Scrapy back in 2017. I think I've had to update it 3 times over the years for changes to the site and web standards.
Good old government sites - rarely change!Reply
Not a lawyer, but many terms of service prohibit interacting with their website in an automated fashion, as well as collecting their data. In my understanding, scraping a site with these terms already puts you in the wrong.Reply
I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.
In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.Reply
As a lawyer whose primary focus is in web scraping, this article is in many ways misleading and inaccurate. While it is true that the Van Buren case is generally positive for web scraping, the overall legal landscape is still murky. The main battleground for web scraping legal issues is shifting from the CFAA to breach of contract and various state-law issues, including misappropriation, unjust enrichment, and trespass to chattels.
In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.Reply
Cloudflare's blocks get in the way of many websites who are simply trying to get a "link preview" of the page, even if it is only a single request from a new IP. I wish they would offer some kind of alternative for the pages they serve instead of a captcha block.Reply
The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.Reply
My toolbox of choice for web scraping is either Nokogiri or puppeteer
Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?Reply
With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.
This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).
Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.Reply
Separate from web scraping, there is the use of automation to perform normal allowable user actions on the site. That should be considered distinct from large scale data extraction noReply
If anyone has anything else they think was missed or should be included then let me know!Reply
for those in this thread with super-serious experience scraping and automating at scale, looking for work (ethical!) please contact me directly.Reply
I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.Reply
Like most here, I am very good at web scraping and automated form fills. I keep trying to figure out a profitable side project or business idea to make out of it and keep coming up with nothing that works.
Any good ideas?Reply
I have been interested in web scraping lately but never really dived too deep. Did anyone have more indepth resources (github projects, blogs, forums, etc) than the tutorials that are basically install beautiful soup and get data from a tag?Reply
What does HN think of web scraping for the purpose of price comparison?
I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.
But I am unable to make a business out of it other than few affiliate commission.Reply