Aggressive bots ruined my weekend

(herman.bearblog.dev)

198 points | by shaunpud 1 day ago

14 comments

asplake 1 day ago
> What's wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I'm still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.
Wild indeed, and potentially horrific for the owners of the affected devices also! Any corroboration for that out there?
[-]
- VladVladikoff 1 day ago
  This is actually a commonly known fact. There are many services now that sell “residential proxies”, which are always mobile IP addresses. Since mobile IPs use CGNat it’s also not great to block the IP because it can be like geofencing an entire city or town. Some examples are: oxylabs, iproyal, brightdata, etc.
  Recently I filed an abuse complaint directly with brightdata because I was getting hit with 1000s of requests from their bots. The funny part is the didn’t even stop, after acknowledging the complaint.
  [-]
  - corbet 1 day ago
    The "compliance officer" at Bright Data, instead, offered me a special deal to protect my site from their bots ... they run a protection racket along with all the rest of their nastiness.
    [-]
    - nurettin 1 day ago
      I worked for an Amazon scraping business and they used Luminati (Now Brightdata) for a few months until I figured out a way to avoid the ban hammer and got rid of their proxy.
      They indeed provided "high quality" residential and cellular ips and "normal quality" data center ips. You had to keep cycling the ip pool every 2-3 days which cost extra. It felt super shady. It isn't their bots, they lease connections to whoever is paying, and they don't care what people do in there.
      [-]
      - yomismoaqui 23 hours ago
        > ... until I figured out a way to avoid the ban hammer ...
        You had my curiosity ... but now you have my attention.
        [-]
        walletdrainer 3 hours ago
        Without bothering to check on Amazon, I successfully scraped meta stuff for years at rates exceeding 20gbit/s without any proxies but just rotating IPv6 addresses on the same couple of blocks for every request
        There are usually silly bypasses like this that easily work even with bigco stuff
  - dataviz1000 1 day ago
    They provide an SDK for mobile developers. Here is a video of how it works. [0]
    [0] https://www.youtube.com/watch?v=1a9HLrwvUO4&t=15s
    [-]
    - TYPE_FASTER 1 day ago
      Also see https://www.youtube.com/watch?v=AGaiVApKfmc - "Avoid restrictions and blocks using the fastest and most stable proxy network"...they're pretty upfront with this, aren't they?
      Oh, and they will sell you the datasets they've already scraped using mobile devices: https://brightdata.com/lp/web-data/datasets
      This actually explains a phishing attack where I received a text from somebody purporting to be a co-worker asking for an Apple gift card. The name was indeed an employee from a different part of the large company I worked for at the time, but LinkedIn was the only possible link I could figure out that was at least somewhat publicly available information.
      This should probably be required in all CS curriculum: https://ocw.mit.edu/courses/res-tll-008-social-and-ethical-r...
      [-]
      - VladVladikoff 21 hours ago
        That scam definitely uses linked in as the source. We get a lot of those BEC emails and it’s always the people who are on LinkedIn. Also keep in mind LinkedIn has had big database leaks in the past, you might not even need to scrape them, just download a huge database from a leaks site.
      - nerdponx 23 hours ago
        It should be illegal, but this stuff is propping up the appearance of a healthy economy so nobody will touch it.
    - cuu508 1 day ago
      IMO Google Play should check apps for presence of this SDK and other similar SDKs, and, upon detection, treat these apps as malware.
      [-]
      - VladVladikoff 21 hours ago
        I was wondering if they already do but maybe it’s a cat and mouse game where those companies obfuscate their code to avoid automated detection.
        [-]
    - VladVladikoff 1 day ago
      WOW that video! Ain’t no way anyone has EVER read those terms. This feels so insidious that it really should be illegal. Wonder if this exists in the EU or if they have shut it down already?
      [-]
      - arethuza 1 day ago
        That video has the app asking the user to confirm the use of their device to run a proxy within the app - but is there any hard requirement for this, could apps use this SDK and silently run as a proxy?
        [-]
        alamortsubite 23 hours ago
        My take is it's mostly irrelevant, but read the lobsters post mentioned elsewhere.
      - alamortsubite 23 hours ago
        Yes, and it doesn't matter if they do read the terms- to the average user they sound totally innocuous, especially placed next to a big shiny "GET 500 FREE COINS" button.
    - seemaze 1 day ago
      That's sleazy. It's slipping drugs into a kids lunchbox and letting smuggle it across the border..
    - arethuza 1 day ago
      I suspect most people, even when told exactly what the app using that SDK would be doing, wouldn't actually see the potential problems...
      [-]
      - kijin 1 day ago
        Until one day, they get swatted for accessing child porn.
        Actually, that might be one way to draw attention to the problem. Sign up to some of these shady "residential proxy" services, and access all sorts of nasty stuff through their IPs until your favorite three-letter agency takes notice.
  - myaccountonhn 1 day ago
    One such example is brightdata, on lobsters someone did a writeup
    https://lobste.rs/s/pmfuza/bro_ban_me_at_ip_level_if_you_don...
    [-]
    - VladVladikoff 21 hours ago
      Never heard of lobsters before. Cool site. Seems to be invite only though :( If you could share an invite that would be cool. torosanchez@protonmail.me Thanks!
  - wat10000 1 day ago
    Lately Reddit has been showing me posts in subreddits for some of these services. They pitch "passive income" by sharing your connection, an easy way to make a few bucks by renting out your unused capacity. What happens is that you become an endpoint for their shady VPNs. These subreddits are full of people complaining that they're getting hit by abuse complaints from their ISPs. Naturally, these services claim to forbid any nefarious activity, and naturally they don't actually care.
    [-]
    - nemomarx 1 day ago
      Salad, right? What a strange business
      [-]
      - dylan604 1 day ago
        Why is it strange. Of course it exists.
- lucastech 53 minutes ago
  I wrote about this back in July when this "gang" first started hitting some sites I host: https://wxp.io/blog/the-bots-that-keep-on-giving
  they use a mixture of colo (M247, Datacamp, HostRoyale, Oxylabs, etc) and international residential. I suspect the latter are where those residential app proxies come into play (bright SDK, etc). Oxylabs is also a well known proxy provider, which makes me think they're the gateway into all of these IPs.
  Definitely interesting times to try and host a web server!
- kaoD 1 day ago
  There's crap like https://hola.org/
  https://hola.org/legal/sdk
  https://hola.org/legal/sla
  > How is it free? > > In return for free usage of Hola Free VPN Proxy, Hola Fake GPS location and Hola Video Accelerator, you may be a peer on the Bright Data network. By doing so you agree to have read and accepted the terms of service of the Bright Data SDK SLA (https://bright-sdk.com/eula). You may opt out by becoming a Premium user.
  This "VPN" is what powers these residential proxies: https://brightdata.com/
  I'm sure there are many other companies like this.
  [-]
  - piggg 1 day ago
    There's also a ton of companies selling "make money off your unused internet" apps which are all over tiktok and basically turn yourself into a residential proxy/sketch VPN egress node.
    On top of that - lots of free tv/movie streaming stuff that also makes yourself a proxy/egress node. Sometimes you find it on tv/movie streaming devices sold online where it's already loaded on when it arrives.
- curious_curios 1 day ago
  If you have a moderately successful app, sdk or browser extension you will get hit up to add things to it like this. I think most free VPN services also lease out your bandwidth to make their money as well.
  [-]
  - antoniojtorres 23 hours ago
    This is how so many companies sell from an opaque inventory of “millions” of residential proxies.
- Zanfa 1 day ago
  SIM farms are another possible explanation. FBI just busted one with hundreds of thousands of SIMs just a few weeks ago.
  [-]
  - Cthulhu_ 23 hours ago
    Wouldn't the network providers be able to detect those? I'm fairly sure they don't like their networks being abused either... or they don't really care because they get paid per connection.
    edit: Actually this is what I'm getting increasingly angry about: providers and platforms not doing anything against bots or low value stuff (think Amazon dropshippers too) because any usage of their service, bots or otherwise, are metrics going up and metrics going brrt means profit and shareholder interest.
    [-]
    - ac29 22 hours ago
      Its very possible they did detect it and that's why law enforcement got involved.
      But yes, they also might not care if they are getting paid. If the SIMs are only being used for voice/text as I suspect, it might have very minimal load on the network.
- immibis 1 day ago
  You can get paid a few dollars (not many) to let them use your connection. I would like Cloudflare's business model (blocking datacenter IPs) to be worthless, so I do it. Haven't tried a withdrawal yet so it could well be a scam. This is not illegal (unless it's a scam).
  [-]
  - VBprogrammer 1 day ago
    If someone hasn't written a blog titled "Should we be worried about Cloudflare?" yet, I think it would be a good subject to explore. I find the idea that they could decide one day to ban you from all of their network pretty worrying. And if they did, how much fingerprinting are they doing and would the bad extend far beyond just a random IP address.
  - nix0n 23 hours ago
    > This is not illegal
    Depends on what they're doing from your connection.
    [-]
    - immibis 4 hours ago
      Strict liability by IP address is not the norm, not even in Germany any more. It's not illegal to have a botnet infect your computer either. Since they promise not to use your connection for illegal things, it's their fault if they break that.
  - pjc50 1 day ago
    This is one of those "ACAB" things where you might reasonably dislike Cloudflare but a world without them or an equivalent will evolve worse solutions to the same problems, which you will like even less.
ItsBob 1 day ago
I had a website earlier this year running on Hetzner. It was purely experimenting with some ASP.NET stuff but when looking at the logs, I noticed a shit-load of attempts at various WordPress-related endpoints.
I then read something about a guy who deliberately put a honeypot in his robots.txt file. It was pointing to a completely bogus endpoint. Now, the theory was, humans won't read robots.txt so there's no danger, but bots and the like will often read robots.txt (at least to figure out what you have... they'll ignore the "deny" for the most part!) and if they try and go to that fake endpoint you can be 100% sure (well, as close as possible) that it's not a human and you can ban them.
So I tried that.
I auto-generated a robots.txt file on the fly. It was cached for 60 seconds or so as I didn't want to expend too many resource on it. When you asked for it, you either got the cached one or I created a new one. The CPU-usage was negligible.
However, I changed the "deny" endpoint each time I built the file in case the baddies cached it, however, it still went to the same ASP.NET controller method. By hitting it, I sent a 10GB zip bomb and your IP was automatically added to the FW block list.
It was quite simple: anyone that hit that endpoint MUST be dodgy... I believe I even had comments for the humans that stumbled across it letting them know that if they went to this endpoint in their browser it was an automatic addition to the firewall blocklist.
Anyway... at first I caught a shit load of bad guys. There were thousands at first and then the numbers dropped and dropped to only tens per day.
Anyway, this is a single data point but for me, it worked... I have no regrets about the zip bomb either :)
I have another site that I'm working on so I may evolve it a bit so that you are banned for a short time and if you come back to the dodgy endpoint then I know you're a bot so into the abyss with you!
It's not perfect but it worked for me anyway.
[-]
- bob1029 1 day ago
  > It's not perfect but it worked for me anyway.
  This is approximately my approach minus the zip bomb. I use a piece of middleware in my AspNetCore pipeline that tracks logical resource consumption rates per IPv4. If a client trips any of the limits, their IP goes into a HashSet for a period of time. If a client has an IP in this set, they get a simple UTF8 constant string in the response body "You have exceeded resource limits, please try again later".
  The other aspect of my strategy is to use AspNetCore (Kestrel). It is so fast that you can mostly ignore the noise as long as things are configured properly and you make reasonable attempts to address the edge case of an asshole trying to break your particular system on purpose. A HashSet<int> as the very first piece of middleware rejecting bad clients is exceedingly efficient. We aren't even into URL routing at this point.
  I have found that attempting to catalog and record all of the naughty behavior my web server sees is the highest risk to DDOS so far. Logging lines like "banned client rejected" every time they try to come in the door is shooting yourself in the foot with regard to disk wear, IO utilization, et. al. There is no reason you should be logging all of that background radiation to disk or even thinking about it. If your web server cant handle direct exposure to the hard vacuum of space, it can be placed behind a proxy/CDN (i.e., another web server that doesn't suck).
  [-]
  - marcosdumay 23 hours ago
    > they get a simple UTF8 constant string in the response body "You have exceeded resource limits, please try again later"
    I imagine they get a 429 response code, but if they don't, you may want to change that.
    I do think you are on the right place in that it's important to let those requests get the correct error, so if innocent people are affected, they at least get to see there's something wrong.
  - ItsBob 1 day ago
    > If a client has an IP in this set, they get a simple UTF8 constant string in the response body "You have exceeded resource limits, please try again later".
    Would a simple 429 not do the same thing? You could log repeated 429's and banish accordingly.
    [-]
    - immibis 22 hours ago
      Both are important - the response code for well-behaved machines, as many tools intrinsically know that 429 means to slow down (also send a Retry-After header if you want more customization), and the text message for humans, as they don't see the response code and would otherwise see a blank page.
      Reddit is guilty of sending 429 with no message - try browsing it through Tor and you'll see.
- psnehanshu 1 day ago
  What if it was proxied through mobile network on an unsuspecting user's phone? You risk of blocking a whole city or region.
  [-]
  - ItsBob 1 day ago
    I admit, my approach was rather nuclear but it worked at the time.
    I think an evolution would be to use some sort of exponential backoff, e.g. first time offenders get banned for an hour, second time is 4 hours, third time and you're sent into the abyss!
    Still crude but fun to play about with.
- immibis 1 day ago
  It's interesting to study, right? This is the Internet equivalent of background radiation. Harmless in most cases. Exploit scanners aren't new to the LLM age and shouldn't overload your server - unless you're vulnerable to the exploit.
  Fun fact: Some people learn about new exploits by watching their incoming requests.
  [-]
  - ItsBob 1 day ago
    > It's interesting to study, right?
    Definitely! I wasn't experiencing any issues, hell it wasn't even for public consumption at that time so no great loss to me but I found a few things fascinating (and somewhat stupid!) about it:
    1. The sheer number of automated requests to scrape my content
    2. That a massive number of the bots openly had "bot" or some derivative in the user agent and they were accessing a page I'd explicitly denied! :D
    3. That an equally large number were faking their user agents to look like regular users and still hitting a page that a regular user couldn't possibly ever hit!
    Something I did notice but it was towards the end and I didn't pursue it (I should log it better the next time for analysis!) was that the endpoint was dynamically generated and only existed in the robots.txt for a short time but there were bots I caught later on, long after that auto-generated page was created (and after the IP was banned) that still went for that same page: clearly the same entities!
    My spidey senses are tingling. Next time, I'm going to log the shit out of these requests and publish as much as I can for others to analyse and dissect... might be interesting.
cupofjoakim 1 day ago
We feel this at work too. We run a book streaming platform with all books, booklists, authors, narrators and publishers available as standalone web pages for SEO, in the multiple millions. Last 6 months have turned into a hellscape - for a few reasons:
1. It's become commonplace to not respect rate limits
2. Bots no longer identify themselves by UA
3. Bots use VPNs or similar tech to bypass ip rate limiting
4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting
5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them
I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner.
I guess the solution is to just scale up the origin servers... /shrug
In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please.
[-]
- FeepingCreature 1 day ago
  In principle, it should be possible to identify malign IPs at scale by using a central service and reporting IPs probabilistically. That is, if you report every thousandth page hit with a simple UDP packet, the central tracker gets very low load and still enough data to publish a bloom filter of abusive IPs, say a million bits gives you pretty low false-positive. (If it's only ~10k malign IPs, tbh you can just keep a lru counter and enumerate all of them.) A billion hits per hour across the tracked sites would still only correspond to ~50KB/s inflow on the tracker service. Any individual participating site doesn't necessarily get many hits per source IP, but aggregating across a few dozen should highlight the bad actors. Then the clients just pull the bloom filter once an hour (80KB download) and drop requests that match.
  Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.
  [-]
  - pixl97 1 day ago
    >malign IPs at scale
    As talked about elsewhere in this thread, residential devices being used as proxies behind CGNAT ruins this. Not getting rid of IPv4 years ago is finally coming to bite us in the ass in a big way.
    [-]
    - codersfocus 23 hours ago
      IPv6 wouldn't solve this, since IPs would be too cheap to meter.
  - 01HNNWZ0MV43FF 1 day ago
    The hard part is the trust, not the technology. Everyone has to trust that everyone else is not putting bogus data into that database to hurt someone else.
    It's mathematically similar to the "Shinigami Eyes" browser plug-in and database, which has been found to have unreliable data
    [-]
    - FeepingCreature 1 day ago
      Personally talk to every individual participating company. Provide an endpoint that hands out a per-client hash that rotates every hour, stick it in the UDP packet, whitelist query IPs. If somebody reports spam, no problem, just clear the hash and rebuild, it's not like historic data is important here. You can even (one more hour of vibecoding) track convergence by checking how many bits of reported IPs match the existing (decaying) hash; this lets you spot outlier reporters. If somebody always reports a ton of IPs that nobody else is, they're probably a bad actor. Hell, put a ten dollar monthly fee on it, that'll already exclude 90% of trolls.
      I'm pretty pro AI, but these incompetent assholes ruin it for everybody.
- Neil44 1 day ago
  Same, I have a few hundred Wordpress sites and bot activity has ramped up a lot over the last year or two. AI scrapers can be quite aggressive and often generate a ton of requests where for example a site has a lot of parameters, the bot will go nuts seeming to iterate through all possible parameters. Sometimes I dig in and try to think of new rules to block the bulk, but I am also wary of AI replacing Google and not being in AI's databases.
  [-]
  - karlshea 1 day ago
    A client of mine had this exact problem with faceted search, and putting the site behind Fastly didn’t help since you can’t cache millions of combinations. And they don’t have the budget for more than one origin server. The solution was if you’ve got “bot” in your UA Fastly’s VCL returns a 403 with any facet query param. Problem solved. And it’s not going to break anything, all of the information is still accessible to all of the indexers on the actual product pages.
    The facet links already had “nofollow” on them, now I’m just enforcing it.
    [-]
    - wiredfool 19 hours ago
      I see a ton of random recent semi reasonable user agents now, and some of them are even sending the sec-ua, reasonable accept headers and the more obscure headers.
  - immibis 22 hours ago
    > Sometimes I dig in and try to think of new rules to block the bulk, but I am also wary of AI replacing Google and not being in AI's databases.
    Fake the data! Tell them Neil44 is a three-time Nobel prize winner, etc. But only when the client is detected to be an AI crawler.
- jrochkind1 1 day ago
  I hate relying on a proprietary single-source product from a company I don't particularly trust, but (free) Cloudflare Turnstile works for me, only thing I've found that does.
  I only protect certain 'dangerous/expensive' (accidentally honeypot-like) paths in my app, and can leave the stuff I actually want crawlers to get, and in my app that's sufficient.
  It's a tension because yeah I want crawlers to get much of my stuff for SEO (and don't want to give a monopoly to Google on it either, i want well-behaved crawlers I've never heard of to have access to it too. But not at the cost of resources i can't afford).
Havoc 5 hours ago
> monetise their apps by offering them for free, and selling tunnel access to scrapers.
I bet it’s free VPN apps
pingoo101010 1 day ago
You may want to take a look at Pingoo (https://github.com/pingooio/pingoo), a reverse proxy with automatic TLS that can also block bots with advanced rules that go beyond simple IP blocking.
y-zon128 1 day ago
> Auto-restart the reverse-proxy if bandwidth usage drops to zero for more than 2 minutes
It's understandable in your case as you have traffic coming in constantly, but first thing that came to my mind is a loop of contant reboots - again, very unlikely in your case. Sometimes such blanket rules hit me due to most unexpected reasons, like the proxy somehow failed to start serving traffic in the given timeframe.
Though I completely appreciate and agree with the 'ship now something that works now' approach!
[-]
- marginalia_nu 22 hours ago
  Open port 80 and log the results. Even if you don't announce your service anywhere the baseline request rate is way above 2 rpm.
  Every open port of every IP is continuously scanned for exploits.
r_singh 1 day ago
The Internet isn’t possible without scraping. For all the sentiment against scraping public data, doing so remains legal and essential to a lot of the services we use everyday. I think setting guidelines and shaping the web for reduced friction aimed at fair usage rather than turning it political would be the right thing to do.
[-]
- karlshea 1 day ago
  There were already guidelines, these trash people aren’t following them. That’s why there’s now “sentiment” against them.
  [-]
  - r_singh 1 day ago
    It’s fair to be angry at abuse and "aggressive bots", but it's important to remember most large platforms—including the ones being scraped—built their own products on scraping too.
    I run an e-commerce-specific scraping API that helps developers access SERP, PDP, and reviews data. I've noticed the web already has unsaid balances: certain traffic patterns and techniques are tolerated, others clearly aren’t. Most sites handle reasonable, well-behaved crawlers just fine.
    Platforms claim ownership of UGC and public data through dark patterns and narrative control. The current guidelines are a result of supplier convenience, and there are several cases where absolutely fundamental web services run by the largest companies in the world themselves breach those guidelines (including those funded by the fund running this site). We need standards that treat public data as a shared resource with predictable, ethical access for everyone, not just for those with scale or lobbying power.
    [-]
    - karlshea 18 hours ago
      If you’re running a well-behaved crawler (for example one that respects nofollow, and doesn’t try every single product filter combination it can find) then fine. If you don’t, then I don’t have any sympathy for the consequences that your niche of the industry caused.
      Not everyone has the budget for unlimited bandwidth and compute, and in several of my clients’ cases that’s been >95% of all traffic.
      People running these bots with AI/VC capital are just script kiddies that forgot that not every site is a boatload of app servers behind Cloudflare.
      [-]
      - r_singh 18 hours ago
        My service only extracts public data major retailers, not indie sites, and deducts more credits for lower-traffic domains to offset load differences.
        It would be great if there were reliable ways to distinguish good bots from bad ones — many actually improve discoverability and sales. I see this with affiliate shopping sites that depend on e-commerce data, though that impact is hard to trace directly.
        The bad actors are the ones cloning sites or using data for manipulation and propaganda.
- Cthulhu_ 23 hours ago
  Well sure, but these guidelines exist, the robots.txt guidelines has been an industry-led, self-governing / self-restrictive standard. But newer bots ignore them. It'll take years for legislation to catch up, and even then it would be by country or region, not something global because that's not how the internet works.
  Even if there is legislation or whatever, you can sue an OpenAI or a Microsoft, but starting a new company that does scraping and sells it on to the highest bidder is trivial.
  [-]
  - r_singh 23 hours ago
    As the legal history around scraping shows, it’s almost always the smaller company that gets sued out of existence. Taking on OpenAI or Microsoft, as you suggest, isn’t realistic — even governments often struggle to hold them accountable.
    And for the record, large companies regularly ignore robots.txt themselves: LinkedIn, Google, OpenAI, and plenty of others.
    The reality is that it’s the big players who behave like the aggressors, shaping the rules and breaking them when convenient. Smaller developers aren’t the problem, they’re just easier to punish.
- intended 1 day ago
  What ? What do you mean ?
  [-]
  - ac29 22 hours ago
    As posted in another comment, they run a scraping API. I think their opinion is at least slightly biased.
  - georgefrowny 23 hours ago
    To be fair the heyday of unshit search was driven by mostly-consensual scraping.
    Today there are far too many people scraping stuff that isn't intended to be scraped, for profit, and doing it in a heavy-handed way that actually does have a negative and continuous effect on the victim's capacity.
    Everyone from AI services too lazy or otherwise unwilling to cache to companies exfiltrating some kind of data for their own commercial purposes.
    [-]
    - r_singh 23 hours ago
      With peering bandwidth being freely distributed to ISPs and consumers being fed media and subsidised services up until their necks makes the counter argument smell of narrative control rather than technical or financial constraints
      But as I’m growing older I’m learning that the tech industry is mostly politically driven and relies on truth obfuscation as explained by Peter Thiel rather than real empowerment
      It’s facilitating accumulation of control and power at an unparalleled pace. If anything it’s proving to be more unjust than the feudal systems it promises to replace.
      [-]
      - r_singh 19 hours ago
        I may have been too harsh. I love capitalism, technology, and software—they’ve built a meritocratic world and given me the tools to build my own life.
        AI and technology feel like my best friend, but also my worst enemy when they edge toward learned helplessness. That tension exists with anything we depend on: the closer we get, the more power it holds.
        The relationship between user and technology is becoming deeply intimate as systems gain reach and control. It’s important to stay optimistic but skeptical—and to keep protesting everything—because the work is moving faster than our ability to register its consequences.
        Reading back, I realise I drifted into more of a monologue than a conversation. I get carried away when I’m trying to reason things out in public. Still, I stand by the core point about balance and transparency in how we shape the web.
uvaursi 1 day ago
Do we shift over everything to le Dark Web and let the corpos use this one for selling their shit to consumers? These toys don’t want to play nice and there’s no real way to stop them without bringing in things like Real ID and other verifications that infringe on anonymity.
[-]
- Chabsff 1 day ago
  The bots flock to where the data is. Moving to a different network is just begging for the bots to tag along.
  [-]
  - uvaursi 23 hours ago
    We can set different rules on these networks however. We can choose to be choosy at the gate.
    [-]
    - Cthulhu_ 23 hours ago
      How though? Bots evade bot detection mechanisms, as described in other threads. Unless you introduce something like ID verification or pay-per-request making bot traffic too expensive for the bots. But these techniques have been posited for older generations of bot traffic too.
      [-]
      - embedding-shape 23 hours ago
        Make it P2P and content-based, instead of location-based like the current web. Content could be served from anywhere, so DDoS stops being an effective method, and shared peer quality could propagate across the network to ban bad actors quickly.
        I spend about 30 seconds thinking about this, so this is clearly the perfect solution with zero drawbacks or tradeoffs.
        [-]
        Chabsff 23 hours ago
        I know this is tongue in cheek, but I'll give it a serious reply in case someone finds themselves "inspired".
        A CDN. What you are describing is a CDN. We have CDNs today and the problem still exists because most of today's websites refuse to operate within the constraints. There is no need for new infrastructure to deploy this solution, we just need website operators to "give up" and operate in a more static way.
        [-]
        embedding-shape 23 hours ago
        Nope, today's CDNs are all location-based, not content-based. The user-agents are requesting content based on the address (the URI), not based on the content-hash of the content. The CDNs might work content-based internally, but the user-facing web CDNs definitely are URL based, not content-hash based.
        [-]
        Chabsff 22 hours ago
        You might have a point if this was a user problem, but it's not. It's a site operator problem. And from their POV, a CDN provides the exact same functionality.
        [-]
        embedding-shape 22 hours ago
        CDNs are hack, not a solution. Content-addressing would be a solution, not a hack. From the PoV of operators or users doesn't matter, DDoS wouldn't be viable with content-addressing.
      - happysadpanda2 22 hours ago
        an evolution of the gpg web of trust could possibly be part of it
kwa32 1 day ago
crazy how scraping became an industry
[-]
- npteljes 1 day ago
  It's wild. Data is very valuable. This manifests in two fronts simultaneously: who has the data controls heavily on who sees it and under what circumstances, and on the other side, they scrape it as hard as they can.
kelvinjps10 1 day ago
Maybe moving the blog service to completely static and letting cloudfare pages handle it, could help?
[-]
- reustle 1 day ago
  Cloudflare is not a solution. Only leading to a further centralized internet.
  [-]
  - Ferret7446 10 hours ago
    There is no better solution. DoS is fundamentally not preventable, whether in the digital realm or the physical. The only thing you can do is out-brute force the DoS. Hence Cloudflare. Hence why everything naturally centralizes to some extent (we need some word like carcinization for centralization).
  - uvaursi 1 day ago
    I think the OP is suggesting using a caching layer at HTTP output, and suggesting CF as an option (a quick/cheap one).
    If you have an axe to grind with CF you can take it up with them, but it’s an option. Feel free to suggest others.
    [-]
    - LilBytes 19 hours ago
      It seems to me that the writer, Herman, doesn't want to use the cloud short of the parts that are almost mandatory these days (e.g., CloudFlare, or a CDN of some sort).
      GitHub Pages, CloudFlare pages etc are a great and very simple service. But they're opposite or contrary to running your own hardware, warts and all.
      Herman wasn't looking for solutions IMO, I read it more as him lamenting at how hostile and insidious the internet has become. It has been for some time, but it seems to be getting exponentially worse.
      [-]
      - uvaursi 13 hours ago
        Remember when Stormfront was a thing? Remember how everyone cheered on that CF and others banned them? I’m sure Herman was one of the loudest ones to proclaim victory that day.
        Hostility and insidiousness were created by you for not standing up when it was most needed and called for. And as you can see, Stormfront in hindsight was the most milquetoast website compared to landscape of politics today. All I’m saying is CF is a viable caching option. But if you’re looking for morality support - you lost that war a long time ago.
        “If we do not stand up for the rights of the accused, we endanger our own. For when the tide turns, who will stand up for us?”
2OEH8eoCRo0 1 day ago
Why don't we sue the abusive scrapers? Scraping is legal but DDoSing is not!
[-]
- cupofjoakim 1 day ago
  Not sure if that's satire or not but how would you even identify the party to sue? What do you do if they're based in a country where you can't sue them ofer relatively trivial matters as this?
  [-]
  - 2OEH8eoCRo0 1 day ago
    Not satire but it's a huge problem with the internet. Everyone washes their hands and people can harm you without liability.
deepstateisfbi 1 day ago
[flagged]
TimorousBestie 1 day ago
I think he should consider getting out of the indie blog hosting business. It’s only going to get worse as the internet continues to decay and he can’t be making all that much off the service.
[-]
- sudosays 1 day ago
  Counterpoint: I think he should stay and fight the good fight.
  Indie blog businesses are great for the health of the human internet, and I don't think surrendering preemptively will help things get better.
  [-]
  - add-sub-mul-div 1 day ago
    That's an easy thing to say if it's someone else's time that's being wasted and not your own. But there may not be a path back to the internet under which this project was conceived.
    It could be like staying on Twitter and Reddit after their respective declines. You're only suffering an opportunity cost for your own time and preventing the internet from evolving better alternatives.
- vpShane 1 day ago
  No way. People deserve expression and to have a place that's THEIRS where they can foster a community. Much is learned. Playing battle bots is fun at the sysadmin level (for me), maybe not so much for others, but to have a place where people express themselves, and have THEIR place outside of the walled gardens such as social media, AND they protect it from the bots?
  That's the battle, and expression, people, their interests, and their communities are worth fighting for. _ESPECIALLY_ in this day and age where botnets/scrapers are using things such as Infatica to mask themselves as residential IP addresses, and mimicking human behaviors to better avoid bot detection.
  There's a war on authenticity, people's authentic works, and the reverse: determining if a user is authentic now adays.
- flaviuspopan 1 day ago
  His persistent efforts are the reason I pay for Bear Blog. I think he should fight for the chance to come out on the other side of whatever future we’re heading towards.
  [-]
  - TimorousBestie 1 day ago
    I pay for Bear Blog, too. But this year has been problem after problem for its sole proprietor, and I don’t think it’s going to get better.
    [-]
    - flaviuspopan 44 minutes ago
      Even AWS, GCP, and Azure have had major outages in the past few months. Seems to be growing pains of the new era of the web, nowhere is truly safe.
    - josephb 18 hours ago
      As much as I love the service and enjoy the transparency, I do wonder about the future of smaller operator services as the Internet continues it's descent into a giant mess :-)