Hacker News Re-Imagined

The State of Web Scraping 2022

  • 291 points
  • 13 days ago

  • @Ian_Kerins
  • Created a post

The State of Web Scraping 2022


@NDizzle 13 days

Replying to @Ian_Kerins 🎙

I still have a daily job running a web scraper I first wrote with Scrapy back in 2017. I think I've had to update it 3 times over the years for changes to the site and web standards.

Good old government sites - rarely change!

Reply


@bobblywobbles 13 days

Replying to @Ian_Kerins 🎙

Not a lawyer, but many terms of service prohibit interacting with their website in an automated fashion, as well as collecting their data. In my understanding, scraping a site with these terms already puts you in the wrong.

Reply


@blantonl 13 days

Replying to @Ian_Kerins 🎙

I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.

Reply


@KieranMac 13 days

Replying to @Ian_Kerins 🎙

As a lawyer whose primary focus is in web scraping, this article is in many ways misleading and inaccurate. While it is true that the Van Buren case is generally positive for web scraping, the overall legal landscape is still murky. The main battleground for web scraping legal issues is shifting from the CFAA to breach of contract and various state-law issues, including misappropriation, unjust enrichment, and trespass to chattels.

In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.

Reply


@cblconfederate 13 days

Replying to @Ian_Kerins 🎙

Cloudflare's blocks get in the way of many websites who are simply trying to get a "link preview" of the page, even if it is only a single request from a new IP. I wish they would offer some kind of alternative for the pages they serve instead of a captcha block.

Reply


@JJxFile 12 days

Replying to @Ian_Kerins 🎙

The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.

Reply


@fareesh 12 days

Replying to @Ian_Kerins 🎙

My toolbox of choice for web scraping is either Nokogiri or puppeteer

Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?

Reply


@mellosouls 13 days

Replying to @Ian_Kerins 🎙

With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.

:

This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).

Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.

Reply


@darepublic 12 days

Replying to @Ian_Kerins 🎙

Separate from web scraping, there is the use of automation to perform normal allowable user actions on the site. That should be considered distinct from large scale data extraction no

Reply


@Ian_Kerins 13 days

Replying to @Ian_Kerins 🎙

If anyone has anything else they think was missed or should be included then let me know!

Reply


@ok_coo 13 days

Replying to @Ian_Kerins 🎙

Time for me to advocate again for people to use Common Crawl. Please don't slam peoples' websites, look for alternatives before scraping. There are probably other, better options. APIs, data set downloads, etc.

https://commoncrawl.org/

Reply


@slvrspoon 12 days

Replying to @Ian_Kerins 🎙

for those in this thread with super-serious experience scraping and automating at scale, looking for work (ethical!) please contact me directly.

Reply


@joe_91 13 days

Replying to @Ian_Kerins 🎙

I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.

Reply


@newsbinator 13 days

Replying to @Ian_Kerins 🎙

Like most here, I am very good at web scraping and automated form fills. I keep trying to figure out a profitable side project or business idea to make out of it and keep coming up with nothing that works.

Any good ideas?

Reply


@coverj 12 days

Replying to @Ian_Kerins 🎙

I have been interested in web scraping lately but never really dived too deep. Did anyone have more indepth resources (github projects, blogs, forums, etc) than the tutorials that are basically install beautiful soup and get data from a tag?

Reply


@gmanis 12 days

Replying to @Ian_Kerins 🎙

What does HN think of web scraping for the purpose of price comparison?

I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.

But I am unable to make a business out of it other than few affiliate commission.

Reply


About Us

site design / logo © 2022 Box Piper