22 comments

  • dang 2 hours ago
    Recent and related:

    AI companies cause most of traffic on forums - https://news.ycombinator.com/item?id=42549624 - Dec 2024 (438 comments)

  • ericholscher 4 hours ago
    This keeps happening -- we wrote about multiple AI bots that were hammering us over at Read the Docs for >10TB of traffic: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...

    They really are trying to burn all their goodwill to the ground with this stuff.

    • PaulHoule 4 hours ago
      In the early 2000s I was working at a place that Google wanted to crawl so bad that they gave us a hotline number to crawl if their crawler was giving us problems.

      We were told at that time that the "robots.txt" enforcement was the one thing they had that wasn't fully distributed, it's a devilishly difficult thing to implement.

      It boggles my mind that people with the kind of budget that some of these people have are struggling to implement crawling right 20 years later tough. It's nice those folks got a rebate.

      One of the problems why people are testy today is that you pay by the GB w/ cloud providers; about 10 years ago I kicked out the sinosphere crawlers like Baidu because they were generating like 40% of the traffic on my site crawling over and over again and not sending even a single referrer.

    • TuringNYC 4 hours ago
      Serious question - if robots.txt are not being honored, is there a risk that there is a class action from tens of thousands of small sites against both the companies doing the crawling and individual directors/officers of these companies? Seems there would be some recourse if this is done at at large enough scale.
      • krapp 4 hours ago
        No. robots.txt is not in any way a legally binding contract, no one is obligated to care about it.
        • vasco 3 hours ago
          If I have a "no publicity" sign in my mailbox and you dump 500 lbs of flyers and magazines by my door every week for a month and cause me to lose money dealing with all the trash, I think I'd have a reasonable ground to sue even if there's no contract saying you need to respect my wish.

          End of the day the claim is someone's action caused someone else undue financial burden in an way that is not easily prevented beforehand, so I wouldn't say it's a 100% clear case but I'm also not sure a judge wouldn't entertain it.

          • oldpersonintx 1 hour ago
            [dead]
          • krapp 3 hours ago
            I don't think you can sue over what amounts to an implied gentleman's agreement that one side never even agreed to and win but if you do, let us know.
            • boredatoms 3 hours ago
              You can sue whenever anyone harms you
              • krapp 3 hours ago
                I didn't say no one could sue, anyone can sue anyone for anything if they have the time and the money. I said I didn't think someone could sue over non-compliance with robots.txt and win.

                If it were possible, someone would have done it by now. It hasn't happened because robots.txt has absolutely no legal weight whatsoever. It's entirely voluntary, which means it's perfectly legal not to volunteer.

                But if you or anyone else wants to waste their time tilting at legal windmills, have fun ¯\_(ツ)_/¯.

                • vasco 2 hours ago
                  You don't even need to mention robots.txt, there's plenty of people that have been sued for crawling and had to stop it and pay damages, just lookup "crawling lawsuits".
                • macintux 2 hours ago
                  Your verbs, “sue” and “win”, are separated by ~16 words of flowery language. It’s not surprising that people gave up partway through and reacted to just the first verb.
                  • echoangle 2 hours ago
                    The „well everyone can sue anyone for anything“ is a non-serious gotcha answer anyways. If someone asks „can I sue XY because of YZ“, the always mean „and have a chance of winning“. Just suing someone without any chance of winning isn’t very interesting.
                  • krapp 54 minutes ago
                    "flowery language?" It's just a regular sentence with regular words in it.
        • ericmcer 3 hours ago
          You can sue over literally anything, the parent comment could sue you if they could demonstrate your reply damaged them in some way.
        • jdenning 3 hours ago
          We need a way to apply a click-through "user agreement" to crawlers
    • huntoa 3 hours ago
      Did I read it right that you pay 62,5$/TB?
    • Uptrenda 3 hours ago
      Hey man, I wanted to say good job on read the docs. I use it for my Python project and find it an absolute pleasure to use. Write my stuff in restructured text. Make lots of pretty diagrams (lol), slowly making my docs easier to use. Good stuff.

      Edit 1: I'm surprised by the bandwidth costs. I use hetzner and OVH and the bandwidth is free. Though you manage the bare metal server yourself. Would readthedocs ever consider switching to self-managed hosting to save costs on cloud hosting?

    • exe34 4 hours ago
      can you feed them gibberish?
      • blibble 4 hours ago
        here's a nice project to automate this: https://marcusb.org/hacks/quixotic.html

        couple of lines in your nginx/apache config and off you go

        my content rich sites provide this "high quality" data to the parasites

      • Groxx 4 hours ago
        LLMs poisoned by https://git-man-page-generator.lokaltog.net/ -like content would be a hilarious end result, please do!
      • jcpham2 4 hours ago
        This would be my elegant solution, something like an endless recursion with a gzip bomb at the end if I can identify your crawler and it’s that abusive. Would it be possible to feed an abusing crawler nothing but my own locally-hosted LLM gibberish?

        But then again if you’re in the cloud egress bandwidth is going to cost for playing this game.

        Better to just deny the OpenAI crawler and send them an invoice for the money and time they’ve wasted. Interesting form of data warfare against competitors and non competitors alike. The winner will have the longest runway

        • actsasbuffoon 4 hours ago
          It wouldn’t even necessarily need to be a real GZip bomb. Just something containing a few hundred kb of seemingly new and unique text that’s highly compressible and keeps providing “links” to additional dynamically generated gibberish that can be crawled. The idea is to serve a vast amount of poisoned training data as cheaply as possible. Heck, maybe you could even make a plugin for NGINX to recognize abusive AI bots and do this. If enough people install it then you could provide some very strong disincentives.
      • GaggiX 3 hours ago
        The dataset is curated, very likely with a previously trained model, so gibberish is not going to do anything.
        • exe34 3 hours ago
          how would a previously trained model know that Elon doesn't smoke old socks?
          • GaggiX 3 hours ago
            An easy way is to give the model the URL of the page so it can value the content based on the reputation of the source, of course the model doesn't know future events, but gibberish is gibberish, and that's quite easy to filter, even without knowing the source.
    • jcgrillo 3 hours ago
      [flagged]
      • jsheard 2 hours ago
        Judging by how often these scrapers keep pulling the same pages over and over again I think they're just hoping that more data will magically come into existence if they check enough times. Like those vuln scanners which ping your server for Wordpress exploits constantly just in case your not-Wordpress site turned into a Wordpress site since they last looked 5 minutes ago.
      • KTibow 2 hours ago
        I personally predict this won't be as bad as it sounds since training on synthetic data usually goes well (see Phi)
      • spacecadet 2 hours ago
        <3
        • jcgrillo 1 hour ago
          don't be hangin around no catfish elon
  • joelkoen 3 hours ago
    > “OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.

    The IP addresses in the screenshot are all owned by Cloudflare, meaning that their server logs are only recording the IPs of Cloudflare's reverse proxy, not the real client IPs.

    Also, the logs don't show any timestamps and there doesn't seem to be any mention of the request rate in the whole article.

    I'm not trying to defend OpenAI but as someone who scrapes data I think it's unfair to throw around terms "like DDOS attack" without providing basic request rate metrics. This seems to be purely based on the use of multiple IPs, which was actually caused by their own server configuration and has nothing to do with OpenAI.

    • mvdtnz 3 hours ago
      Why should web store operators have to be so sophisticated to use the exact right technical language in order to have a legitimate grievance?

      How about this: these folks put up a website in order to serve customers, not for OpenAI to scoop up all their data for their own benefit. In my opinion data should only be made available to "AI" companies on an opt-in basis, but given today's reality OpenAI should at least be polite about how they harvest data.

  • jonas21 3 hours ago
    It's "robots.txt", not "robot.txt". I'm not just nitpicking -- it's a clear signal the journalist has no idea what they're talking about.

    That and the fact that they're using a log file with the timestamps omitted as evidence of "how ruthelessly an OpenAI bot was accessing the site" makes the claims in the article a bit suspect.

    OpenAI isn't necessarily in the clear here, but this is a low-quality article that doesn't provide much signal either way.

    • peterldowns 2 hours ago
      Well said, I agree with you.
    • Thoreandan 2 hours ago
      Hear hear. Poor article going out the door for publication with zero editorial checking.
      • joelkoen 2 hours ago
        Haha yeah just noticed they call Bytespider "TokTok's crawler" too
  • griomnib 4 hours ago
    I’ve been a web developer for decades as well as doing scraping, indexing, and analyzing million of sites.

    Just follow the golden rule: don’t ever load any site more aggressively than you would want yours to be.

    This isn’t hard stuff, and these AI companies have grossly inefficient and obnoxious scrapers.

    As a site owner those pisses me off as a matter of decency on the web, but as an engineer doing distributed data collection I’m offended by how shitty and inefficient their crawlers are.

    • PaulHoule 4 hours ago
      I worked at one place where it probably cost us 100x (in CPU) more to serve content the way we were doing it as opposed to the way most people would do it. We could afford it because it was still cheap, but we deferred the cost reduction work for half a decade and went on a war against webcrawlers instead. (hint: who introduced the robots.txt standard?)
    • mingabunga 3 hours ago
      We've had to block a lot of these bots as they slowed our technical forum to a crawl, but new ones appear every now and again. Amazons was the worst
      • griomnib 1 hour ago
        I really wonder if these dogshit scrapers are wholly built by LLM. Nobody competent codes like this.
    • add-sub-mul-div 32 minutes ago
      These people think they're on the verge of the most important invention in modern history. Etiquette means nothing to them. They would probably consider an impediment to their work a harm to the human race.
  • spwa4 2 hours ago
    It's funny how history repeats. The web originally grew because it was a way to get "an API" into a company. You could get information, without a phone call. Then, with forms and credit cards and eventually with actual API's, you could get information, you could get companies to do stuff via an API. For a short while this was possible.

    Now everybody calls this abuse. And a lot of it is abuse, to be fair.

    Now that has been mostly blocked. Every website tries really hard to block bots (and mostly fail because Google funds their crawler millions of dollars while companies raise a stink over paying a single SWE), but it's still at the point that automated interactions with companies (through third-party services for example) are not really possible. I cannot give my credit card info to a company and have it order my favorite foods to my home every day, for example.

    What AI promises, in a way, is to re-enable this. Because AI bots are unblockable (they're more human than humans as far as these tests are concerned). For companies, and for users. And that would be a way to ... put API's into people and companies again.

    Back to step 1.

    • afavour 2 hours ago
      I see it as different history repeating: VC capital inserting itself as the middleman between people and things they want. If all of our interactions with external web sites now go through ChatGPT that gives OpenAI a phenomenal amount of power. Just like Google did with search.
  • PaulHoule 4 hours ago
    First time I heard this story it was '98 or so and the perp was somebody in the overfunded CS department and the victim somebody in the underfunded math department on the other side of a short and fat pipe. (Probably running Apache httpd on a SGI workstation without enough ram to even run Win '95)

    In years of running webcrawlers I've had very little trouble, I've had more trouble in the last year than in the past 25. (Wrote my first crawler in '99, funny my crawlers have gotten simpler over time not more complex)

    In one case I found a site got terribly slow although I was hitting it at much less than 1 request per second. Careful observation showed the wheels were coming off the site and it had nothing to do with me.

    There's another site that I've probably crawled in it's entirety at least ten times over the past twenty years. I have a crawl from two years ago, my plan was to feed it into a BERT-based system not for training but to discover content that is like the content that I like. I thought I'd get a fresh copy w/ httrack (polite, respects robots.txt, ...) and they blocked both my home IP addresses in 10 minutes. (Granted I don't think the past 2 years of this site was as good as the past, so I will just load what I have into my semantic search & tagging system and use that instead)

    I was angry about how unfair the Google Economy was in 2013, in lines with what this blogger has been saying ever since

    http://www.seobook.com/blog

    (I can say it's a strange way to market an expensive SEO community but...) and it drives me up the wall that people looking in the rear view mirror are getting upset about it now.

    Back in '98 I was excited about "personal webcrawlers" that could be your own web agent. On one hand LLMs could give so much utility in terms of classification, extraction, clustering and otherwise drinking from that firehose but the fear that somebody is stealing their precious creativity is going to close the door forever... And entrench a completely unfair Google Economy. It makes me sad.

    ----

    Oddly those stupid ReCAPTCHAs and Cloudflare CAPTCHAs torment me all the time as a human but I haven't once had them get in the way of a crawling project.

  • vzaliva 3 hours ago
    From the article:

    "As Tomchuk experienced, if a site isn’t properly using robot.txt, OpenAI and others take that to mean they can scrape to their hearts’ content."

    The takeaway: check your robots.txt.

    The question of how much load requests robots can reasonably generate when allowed is a separate matter.

    • krapp 3 hours ago
      Also probably consider blocking them with .htaccess or your server's equivalent, such as here: https://ethanmarcotte.com/wrote/blockin-bots/

      All this effort is futile because AI bots will simply send false user agents, but it's something.

      • Sesse__ 3 hours ago
        I took my most bothered page IPv6-only, the AI bots vanished in the course of a couple days :-) (Hardly any complaints from actual users yet. Not zero, though.)
  • tonetegeatinst 1 hour ago
    What options exist if you want to handle this traffic and you own your hardware on prem?

    It seems that any router or switch over 100G is extremely expensive, and often requires some paid for OS.

    The pro move would be to not block these bots. Well I guess block them if you truly can't handle their throughput request (would an ASN blacklist work?)

    Or if you want to force them to slow down, start sending data, but only a random % of responses are sent (so say ignore 85% of the traffic they spam you with, and reply to the others at a super low rate or you could purposely send bad data)

    Or perhaps reachout to your peering partners and talk about traffic shaping these requests.

  • methou 2 hours ago
    I used to have some problem with some Chinese crawlers, first I told them no with robots.txt, then I see a swarm of of non-bot user-agents from cloud providers in China, so I blocked their ASN, and then I see another rise of IPs from some Chinese ISP, so I eventually I have to block the entire country_code = cn and show them a robots.txt
  • Hilift 2 hours ago
    People who have published books recently on Amazon have noticed that immediately there are fraud knockoff copies with the title slightly changed. These are created by AI, and are competing with humans. A person this happened to was recently interviewed about their experience on BBC.
  • readyplayernull 2 hours ago
    > has a terms of service page on its site that forbids bots from taking its images without permission. But that alone did nothing.

    It's time to level up in this arms race. Let's stop delivering html documents, use animated rendering of information that is positioned in a scene so that the user has to move elements around for it to be recognizable, like a full site captcha. It doesn't need to be overly complex for the user that can intuitively navigate even a 3D world, but will take x1000 more processing for OpenAI. Feel free to come up with your creative designs to make automation more difficult.

    • liamwire 2 hours ago
      Sounds entirely at odds with any accessibility requirements.
  • andrethegiant 3 hours ago
    I'm working on fixing this exact problem[1]. Crawlers are gonna keep crawling no matter what, so a solution to meet them where they are is to create a centralized platform that builds in an edge TTL cache, respects robots.txt and retry-after headers out of the box, etc. If there is a convenient and affordable solution that plays nicely with websites, the hope is that devs will gravitate towards the well-behaved solution.

    [1] https://crawlspace.dev

  • philipwhiuk 1 hour ago
    I had the same problem with my club's website.
  • atleastoptimal 4 hours ago
    Stuff like this will happen to all websites soon due to AI agents let loose on the web
  • OutOfHere 3 hours ago
    Sites should learn to use HTTP error 429 to slow down bots to a reasonable pace. If the bots are coming from a subnet, apply it to the subnet, not to the individual IP. No other action is needed.
    • Sesse__ 3 hours ago
      I've seen _plenty_ of user agents that respond to 429 by immediately trying again. Like, literally immediately; full hammer. I had to eventually automatically blackhole IP addresses that got 429 too often.
      • jcgrillo 3 hours ago
        It seems like it should be pretty cheap to detect violations of Retry-After on a 429 and just automatically blackhole that IP for idk 1hr.

        It could also be an interesting dataset for exposing the IPs those shady "anonymous scraping" comp intel companies use..

      • OutOfHere 2 hours ago
        That is just what a bot does by default. It will almost always give up after a few retries.

        The point of 429 is that you will not be using up your limited bandwidth sending the actual response, which will save you at least 99% of your bandwidth quota. It is not to find IPs to block, especially if the requestor gives up after a few requests.

        The IPs that you actually need to block are the ones that are actually DoSing you without stopping even after a few retries, and even then only temporarily.

  • ThrowawayTestr 4 hours ago
    Has anyone been successfully sued for excess hosting costs due to scraping?
    • neom 4 hours ago
      https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn makes it clear scraping publicly available data is generally not a CFAA violation. Certainly it would have to be a civil matter, but I doubt it would work (ianal)
      • ericholscher 4 hours ago
        We did get $7k out of one of the AI companies based on the massive bandwidth costs they caused us.

        https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...

        • neom 4 hours ago
          wow GOOD JOB!!! Were they relatively decent about it, is that why? I feel like normal businesses that are not super shady should be able to accept this kind of conversation and deal with the mistake and issue they causes for you.

          Good job perusing it tho, that's fantastic. (ps, big fan of your product, great work on that too!)

  • 1oooqooq 3 hours ago
    is there a place with a list of aws servers these companies can block?
  • more_corn 3 hours ago
    • 1oooqooq 3 hours ago
      they are problably hosting the bots
  • ldehaan 4 hours ago
    [dead]
  • peebee67 4 hours ago
    Greedy and relentless OpenAI's scraping may be, but that his web-based startup didn't have a rudimentary robots.txt in place seems inexcusably naive. Correctly configuring this file has been one of the most basic steps of web design for living memory and doesn't speak highly of the technical acumen of this company.

    >“We’re in a business where the rights are kind of a serious issue, because we scan actual people,” he said. With laws like Europe’s GDPR, “they cannot just take a photo of anyone on the web and use it.”

    Yes, and protecting that data was your responsibility, Tomchuck. You dropped the ball and are now trying to blame the other players.

    • mystified5016 3 hours ago
      OpenAI will happily ignore robots.txt

      Or is that still my fault somehow?

      Maybe we should stop blaming people for "letting" themselves get destroyed and maybe put some blame on the people actively choosing to behave in a way that harms everyone else?

      But then again, they have so much money so we should all just bend over and take it, right?

      • peebee67 2 hours ago
        If they ignore a properly configured robots.txt and the licence also explicitly denies them use, then I'd guess they have a viable civil action to extract compensation. But that isn't the case here at all, and while there's reports of them doing so, they certainly claim to respect the convention.

        As for bending over, if you serve files and they request files, then you send them files, what exactly is the problem? That you didn't implement any kind of rate limiting? It's a web-based company and these things are just the basics.

  • peterldowns 4 hours ago
    I have little sympathy for the company in this article. If you put your content on the web, and don't require authentication to access it, it's going to be crawled and scraped. Most of the time you're happy about this — you want search providers to index your content.

    It's one thing if a company ignores robots.txt and causes serious interference with the service, like Perplexity was, but the details here don't really add up: this company didn't have a robots.txt in place, and although the article mentions tens/hundreds of thousands of requests, they don't say anything about them being made unreasonably quickly.

    The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.

    EDIT: They're a very media-heavy website. Here's one of the product pages from their catalog: https://triplegangers.com/browse/scans/full-body/sara-liang-.... Each of the body-pose images is displayed at about 35x70px but is served as a 500x1000px image. It now seems like they have some cloudflare caching in place at least.

    I stand by my belief that unless we get some evidence that they were being scraped particularly aggressively, this is on them, and this is being blown out of proportion for publicity.

    • swatcoder 4 hours ago
      > I have little sympathy for the company in this article. If you put your content on the web, and don't require authentication to access it, it's going to be crawled and scraped. Most of the time you're happy about this — you want search providers to index your content.

      If I stock a Little Free Library at the end of my driveway, it's because I want people in the community to peruse and swap the books in a way that's intuitive to pretty much everyone who might encounter it.

      I shouldn't need to post a sign outside of it saying "Please don't just take all of these at once", and it'd be completely reasonable for me to feel frustrated if someone did misuse it -- regardless of whether the sign was posted or not.

    • dghlsakjg 3 hours ago
      There is nothing inherently illegal about filling a small store to occupancy capacity with all of your friends and never buying anything.

      Just because something is technically possible and not illegal does NOT make it the right thing to do.

      • riffraff 3 hours ago
        as the saying goes "it's not illegal" is a very low bar for morality.
    • agmater 4 hours ago
      From the Wayback Machine [0] it seems they had a normal "open" set-up. They wanted to be indexed, but it's probably a fair concern that OpenAI isn't going to respect their image license. The article describes the robot.txt [sic] now "properly configured", but their solution was to block everything except Google, Bing, Yahoo, DuckDuckGo. That seems to be the smart thing these days, but it's a shame for any new search engines.

      [0] https://web.archive.org/web/20221206134212/https://www.tripl...

      • peterldowns 3 hours ago
        The argument about image/content licensing is, I think, distinct from the one about how scrapers should behave. I completely agree that big companies running scrapers should be good citizens — but people hosting content on the web need to do their part, too. Again, without any details on the timing, we have no idea if OpenAI made 100k requests in ten seconds or if they did it over the course of a day.

        Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.

        • dghlsakjg 1 hour ago
          > Publicly publishing information for others to access and then complaining that ~1 rps takes your site down is not sympathetic. I don't know what the actual numbers and rates are because they weren't reported, but the fact that they weren't reported leads me to assume they're just trying to get some publicity.

          They publicly published the site for their customers to browse, with the side benefit that curious people could also use the site in moderation since it wasn't affecting them in any real way. OpenAI isn't their customer, and their use is affecting them in terms of hosting costs and lost revenue from downtime.

          The obvious next step is to gate that data behind a login, and now we (the entire world) all have slightly less information at our fingertips because OpenAI did what they do.

          The point is that OpenAI, or anyone doing massive scraping ops should know better by now. Sure, the small company that doesn't do web design had a single file misconfigured, but that shouldn't be a 4 or 5 figure mistake. OpenAI knows what bandwidth costs. There should be a mechanism that says, hey, we have asked for many gigabytes or terrabytes of data from a single domain scrape, that is a problem.

    • icehawk 2 hours ago
      > although the article mentions tens/hundreds of thousands of requests, they don't say anything about them being made unreasonably quickly.

      It's the first sentence of the article.

      On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down.

      If a scraper is making enough requests to take someone else's website down, the scraper's requests are being made unreasonably quickly.

    • nitwit005 3 hours ago
      Let us flip this around: If your crawler regularly knocks websites offline, you've clearly done something wrong.

      There's no chance every single website in existence is going to have a flawless setup. That's guaranteed simply from the number of websites, and how old some of them are.

    • JohnMakin 4 hours ago
      robots.txt as of right now is a complete honor system, so I think it's reasonable to make a conclusion that you shouldn't rely on it protecting you because odds are overwhelming that scraping behavior will become worse in the near to mid term future
    • fzeroracer 4 hours ago
      > If you put your content on the web, and don't require authentication to access it, it's going to be crawled and scraped. Most of the time you're happy about this — you want search providers to index your content

      > The default-public accessibility of information on the internet is a net-good for the technology ecosystem. Want to host things online? Learn how.

      These two statements are at odds, I hope you realize. You say public accessibility of information is a good thing, while blaming someone for being effectively DDOS'd as a result of having said information public.

      • hd4 4 hours ago
        They're not at odds. "default-public accessibility of information" doesn't necessarily translate into "default-public accessibility of content" ie. media. Content should be served behind an authentication layer.

        The clickbaity hysteria here is missing out how this sort of scraping has been possible long before AI agents showed up a couple of years back.

        • macintux 2 hours ago
          Of course it was possible, but the incentives have changed. Now anyone can use the accumulated knowledge of the world to build something new, so more independent actors are doing so, often very badly.
    • j45 4 hours ago
      It's less about sympathy and more about understanding that they might not be experts in things tech, relied on hired help that seemed to be good at what they did, and the most basic thing (setup a free cloudflare account or something) was missed.

      Learning how, is sometimes actually learning who's going to get you online in a good way.

      In this case when you have non-tech people building Wordpress sites, it's about what they can understand and do, and teh rate of learning doesn't always keep up relative to client work.