Cloudflare is known to use fingerprinting to detect scrapers For example, they use JA3 fingerprints and match them against the UA to block stuff like cURL while allowing OkHttp (Android clients) - but this can be easily be spoofed with packages such as CycleTLS [1].
I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection", but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go - completely destroying the privacy of everyone involved.
Cromite, a privacy conscious fork of Chromium for Android, has constantly issues with CloudFlare Turnstile [2] because they (Cloudflare) try to fingerprint it in multiple ways in order to pass the challenge. The only way to get it to work would be to join the CloudFlare Browser Developer program - which requires signing an NDA. Rightfully so, the project maintainer didn't want to do it.
If you want to see the extent of what CloudFlare does to fingerprint the browsers, just have a look in the issue [2] and see which flags need to be disabled in order to allow CloudFlare to pass the challenge.
I understand both sides, but at least CloudFlare could be flexible enough to fall back to PoW instead of just blocking people from sending forms or accessing websites...
it's all for nothing, because Cloudflare's scraping protection works about as well as a $5 padlock - good enough to dissuade bored teens, not good enough to dissuade even an amateur burglar. if someone wants to scrap your publicly visible data, they will. there's nothing you can do.
Exactly. I’m constantly amazed at how little you actually need to bypass CF, Amazon, Azure WAFs and so on (Incapsula springs to mind too). When you look at the code you’ve come up with, it’s actually quite small and compact.
More to the point, these systems actually help scraping because proof of work unlocks essentially unlimited scraping, in my experience.
That said - from my experience on the other side, sure you can’t stop people like me or you, but you can stop 99% of the others. That’s more than worth it operationally.
> Cloudflare's scraping protection works about as well as a $5 padlock
It sure seems to keep me, the casual visitor, far away from just about any site they "protect". I have zero desire to alter my browsing configuration or use extra tools to get around turnstile, I'd rather not even visit the site in the first place.
Overwhelming majority of customers doesn't even know they can care. And most of them wouldn't anyway. So your vote doesn't matter to anyone but you, sadly.
> Is the value provided by Cloudflare to public so great
turnstile is not a public good, it's a private product, promoted to private entities that want to achieve a certain outcome that is beneficial to them privately.
The mass surveillance is a side-effect - an externality that cloudflare does not have to pay for (but we as netizens pay collectively).
It is the role and responsibility of gov't to regulate away externality (or make those who benefit from it pay a cost somehow, to equalize said externality). Unfortunately, like with climate change, nothing has been forthcoming, and only a few people care about the actual damage enough to even talk about it.
So it will go on, and the masses do not have a say.
That’s a better question. So far the answer seems to be yes.
Large companies and banks see >95% fraud on sign in / sign up flows. It’s a constant battle and the law of large numbers says even a tiny false negative rate can be catastrophic.
A bogus GCP or AWS or Azure account costs those companies hundreds to thousands of dollars. I don’t know what the average loss is on fraudulent bank signins, but probably on that order. And there are millions, sometimes billions of attempts per day.
I worked at a tech company that used an off-brand, truly awful captcha provider. Think “drag the mammal to the habitat it lives in, avoiding the wiggly lines”. When this awful provider went down (frequently), we fell back to recaptcha. Fraud rates were 100x higher in those minutes-to-hours outages. Though of course real users were also able to get in at higher rates.
Talking about mass surveillance: After taking the usual measurements against cross-site browser tracking- who knows most about my website visits? Meta, Google or Cloudflare?
Blocking me from site visits with fingerprinting shut off, forces all my traffic back into the CF funnel. Number of websites soaring.
Try it yourself
https://sereneblue.github.io/chameleonhttps://github.com/kkapsner/CanvasBlocker/
and you're increasingly off.
I would like my browser to not pass their challenge and then flush support of services I cannot reach. This is the only way for them to stop, to really get on the nerves of their customers.
Those might ignore it, but there are always alternatives.
It’s always amusing when someone brings up the “just tell banks that if they reduce account takeovers by 80%, it will drive off 3 customers a year (and those are the same 3 customers who call site support to complain the website doesn’t work well on their homebrew Chromium for when running on BSD”
Cloudflare only exists in its current form because banks and such already enthusiastically accepted that trade off.
> but unless you do PoW (which is also ecologically a nightmare)
Can you expand? I don't see a problem with some napkin math.
5W load for 2 seconds is 0.002Wh (we have to let smartphones pass and not by doing PoW for 10s of seconds). 8 billion checks a day for a year = 8GWh.
I stand corrected. It's not a nightmare scenario (as for Bitcoins) - but I'm still of the idea that "useless" computations should be avoided (as we should avoid having 10MB websites).
In any case, according to some napkin math done by Kimi 2.6 (which by itself is probably already consuming more than all of my PoW challenges for the upcoming 5 years) - the situation looks incredibly in favor of PoW:
https://www.kimi.com/share/19e7ef40-a432-8912-8000-0000b4a71...
Which makes me wonder why CloudFlare isn't switching to this already
There's a saying that if an idea is stupid, but it works, it's not stupid.
If some computation is "useless" but it serves it's purpose, it's not useless.
The reason why bitcoin network expends so much energy is down to tokenomics, not the system of PoW itself. At equilibrium we expect the power usage to be (blocks/hr) x (BTC/block) x ($/BTC) x (kWh/$), so it's a function of the BTC price and emission rate.
PoW in other context has way different driving factors. In this case, the marginal improvement of fetching the site again for AI bots isn't enough to cover the PoW cost. The PoW cost is outweighed by the net bandwidth cost of all the parties.
I mean coal power plants work, so building new ones is not stupid by that standard.
I think we have to expand the definition of stupid to include things that work but have net negative externalities. Not sure where PoW falls in that way of looking at things, but we should at least consider it.
(Thinking about it, Captcha is PoW, just theoretically work by the human)
Why not? PoW challenge doesn't whitelist botnets. If the dumb scraper makes only get requests and doesn't solve the challenge, it doesn't matter how it connects, even if it's a perfectly hidden tor exit node.
Because the work would be done by the compromised residential device. No bothnet owner is going to care if their 100,000 rooted routers have to do a little more work. It’s still “free” from their perspective.
The goal of Cloudflare’s fingerprinting is to detect whether a user agent appears to be a legitimate human. It’s not to identify human users across websites.
Every HN thread is full of people who think webmasters should just pay through the nose to handle bot traffic to preserve the sacred rights of turbonerds to visit their website using Lynx on their toaster.
I should think that there should be a better way (e.g. port knocking, instructions for manually correcting the URL that cannot easily be automated, additionally supporting alternative protocols, etc).
Because you can't have both a difficulty with a reasonable page load time and a difficulty that stops bad actors. Attackers have stronger machines and are willing to wait as long as they need to.
8 billion checks per day sounds on the low end. I can imagine it being ten or hundred times more. That still seem pretty fine though. On the other hand, it's hard to see that such a modest energy cost would dissuade any attacks.
> I can imagine it being ten or hundred times more
I don't think I average even 2 captchas a day being terminally online, so 10 across every soul in the world sounds way too much for me. (we're ignoring bots it's meant to deter?)
> it's hard to see that such a modest energy cost would dissuade any attacks.
It's not against targeted attacks, but scrapping.
And not about energy cost, but available compute power -- it requires scrapper to use browser with JS (or time commitment to reimplement PoW outside of JS), limits their request rate by CPU core count.
> I don't think I average even 2 captchas a day being terminally online, so 10 across every soul in the world sounds way too much for me. (we're ignoring bots it's meant to deter?)
You're mixing up checks, fingerprinting, and PoW with a captcha being triggered because those didn't pass. The less abnormal your setup is, the fewer captchas you'll get.
I agree with the rest of what you said.
Also I think you mean "scraper" and not "scrapper".
> but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go
Only as long as legislation and law enforcement is off the table. Almost like we have those because everyone doing their own policing is not a reasonable way to run a society.
Not sure what you mean, Brave blows Firefox out of the water in terms of privacy protections. Firefox has milquetoast fingerprint protection and it doesn't even block ads. uBlock is worse than Brave's blocking by virtue of not being natively integrated.
PoW doesn't fix anything if you have an army of zombie CCTV cameras and smart fridges at your disposal.
It's either proof-of-humanity (increasingly hard to get in this day and age, particularly if accessibility is a concern), proof odf identity (even worse) or proof of system integrity, which is the least bad out of all the terrible options.
Why wouldn't PoW help? If it's tuned so that each device in that army takes 10 seconds instead of 10 milliseconds to make a request, have you not slowed the army down by a factor 1000?
Micropayments would be another one, but then governments and banks have to give up ~~financial control & surveillance~~ AML essentially to make it financially viable. AML also has a horrible track record of how much money is spent compared to the amount recovered.
Wouldn't micropayments be aggregated into a money laundering operation by some third party? (Wasn't IIRC even Spotify used for that?) Or would Cloudflare take all the money in this hypothetical scenario?
>probably fingerprinting is the way to go - completely destroying the privacy of everyone involved
your doctor seeing you naked does not destroy your privacy, it's your doctor sharing the photos with everybody that does. i.e. it problem here is that intermediaries like cloudflare don't work for you, they work for somebody else or sell the data themselves.
> I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection"
They also gate away a good many people with their "bot protection". I am extremely worried about how so many seem to have outsourced the control over who can access their websites to a company, with no second thoughts whatsoever.
The problem is what is the alternative? I'm (not) defending them or this practice by any measure, but we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system. I've hated CAPTCHAs ever since I first encountered them and I can't wait for them to just finally die a permanent death, but I also don't know how we solve the "how do you identify a human and a bot" in a way which doesn't require server admins to have extremely beefy servers or similar setups to handle the extra load. I'm not going to do the "there HAS to be a way thing" either because, for all I know, this could just be one of those impossible-to-solve problems.
> we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system
No, we don't know. I honestly do not understand the problem. I run websites, both static and non-static. Granted, my sites aren't exactly the most popular internet go-to destinations, but I should be seeing this DDoS too, right?
I do see lots of requests. Nothing that any modern system can't handle. Computers are stupid fast these days. Unless you are doing something unreasonable, it's really hard to even notice this "extra load".
I understand there are sites for whom this causes problems, but I think these are rare and could be optimized not to do unreasonable things.
I think too many people are annoyed by AI companies (arguably understandable position), look at their logs and speak of "hammering", "DDoS" and "extra load", while in reality it doesn't matter much.
We do know, just ask anyone who runs a more popular site or does anything where abuse can be monetized (shopping, reviews, etc.). Avoiding that due to obscurity isn’t an answer because it’s saying you’re safe until something, possibly outside of your control, causes the bots to descend and give you an extra 500M requests with no chance of revenue.
I’m with OP: I don’t like this but the alternatives all look like the death of the open web.
The person you're responding to already said they ran a modestly sized site. What actual scale opens one up to abuse? If only the top 1% of sites need it, then it seems silly to say "everyone" needs it.
Stack Overflow was outside of the Cloudflare network for years, and anti-abuse was maybe 3 or 4 full-time jobs – much of which still needs to be done, because Cloudflare's anti-bot protection hasn't actually stopped it. Most UGC sites are not as big as Stack Overflow was at its peak.
I'm referring specifically to the activities of Charcoal (https://charcoal-se.org/) and their Stack Exchange staff counterparts, taken together. This is about large-scale platform abuse, of the sort that Cloudflare is alleged to prevent (but doesn't, really), not the more mundane (and laborious) task of manual quality control.
errr... so anything related to UGC now has a lower bound of 3-4 FTE? Sure, I'll hire a team of content moderators next time I think about putting a comment form under my blog...
Yes? Cloudflare doesn't replace moderators. At all. It only allegedly filters bot generated content, it doesn't filter user generated content and doesn't even intend to.
Please read their last sentence again and think about how much it understates the difference between stack overflow in its prime and a normal website. Also the "much of which still needs to be done".
It might depend on the tech stack. I run a small niche website but it has PHP and a database (MediaWiki/PHPBB) and without Cloudflare I'd estimate I'd need to spend several hundred dollars a month to handle the traffic. Traffic used to be tens of thousands of requests a day. AI has increased that to between 400k and 3M requests per day but it's not a smooth distribution. This is with bot fight mode on that greatly reduces traffic.
I adopted Cloudflare because it was getting DDoSed by the AI crawlers. I'm pretty sure all of them are vibe coding their crawlers and don't bother adding rate limiting as a requirement.
That was my point. I was trying to be gentle by mentioning "unreasonable" things, but seriously — how did we get to the point where less than 6 requests per second (that's 500k requests per day) is considered a DDoS?
I've spent some effort on optimizing my sites, but most of the effort was focused on avoiding unreasonable (stupid) work. Do I need a session for every request? No, I don't! Do I need a database fetch for every access to my homepage? No, I don't! Is it a problem to actually load all of my static content in all supported languages (24) into memory and serve it from memory? No, it isn't!
I use Clojure behind nginx on the server for my sites. Oh, and I also pre-compress all static assets to Brotli, so anything that handles brotli gets a static file served directly from nginx. I also use immutable assets with unlimited caching semantics.
Really — the problem is that we've grown lax and our software has become bloated, slow, and with unreasonable code paths. If every page fetch does 12 database accesses and runs through a slow interpreter, that is surely going to be a problem.
I second this. My website exposes a cgit and 99% of the traffic now is AI scraping the sources, but the load is nowhere near DoS territory. And this is running on the cheapest VPS I could find.
Not saying I'm not annoyed by the scraping; I am looking to block them, but I'm also not going to put the site behind the gatekeeper. If anything, Cloudflare must love AI scraping now for the same reason AV companies love malware.
Now, if you are running a PHP stack...yeah, maybe that's the problem right there.
> 99% of the traffic now is AI scraping the sources
I wonder if we should stop fighting this and instead create an API specifically for this purpose? Or, a central repository that you could send your data to and say to anyone wanting to scrape, "safe yourself some time and just get my data from this other place"
Is there actually any plausible theory why "AI" would repeatedly scrape the same sites? Are there that many competing, completely independent AI labs? Is it cheaper to repeatedly scrape than to buffer the scraped data locally? (I find it very hard to imagine that it's easier to deal with changing/disappearing content than it is to stand up such a cache.)
The PHP stack isn't even the problem, it's having unauthenticated requests getting past the cache in the first place, something that most sites should be able to prevent.
Consider yourself lucky. But don't let yourself fall into the trap of thinking it's a nonissue for everyone else until it happens to you.
People shouldn't have to be experts or provision a larger server to run a UGC service that can withstand the sort of 30x more traffic I'm seeing from AI bots. Or rather, you didn't render the argument for why they should have to do that if they can just use CloudFlare's free tier.
Either way, it's easy to have all the answers when you've never had the problem.
Has anyone pointed an AI scraper at your server at all? Unless your website appears in search engine listings I don't think the AI scrapers will slam it. My server has never been hit by them but my server is also practically unknown. All of this said, I'm not going to claim that server loads can handle it because many sysadmins have claimed otherwise, and I would like to think that their claims are reliable.
As soon as you get your TLS certificate you get bombarded with scraping. You don't need someone to "point a scraper at you".
What matters most is usually how much there is to scrape. If you have like 5 pages that's nothing. For forum like websites where each thread, each user profile, etc. gets scraped that's when traffic increases. I just let them have at it with no issues though, computers are fast.
That's really weird. My experience is quite different: I have several subdomains and all of them have TLS certs and I haven't (yet) seen this (thankfully). Either that, or my server is masking it. The weird thing is that my server is an OVH dedicated box that doesn't exactly have top-tier specs, so I have no idea what's going on there. Very weird indeed.
I mean... It may be that most of the things I run aren't really scrape-able. I run Matrix (which requires authentication), an XWiki instance, Zulip, Terraria, Forgejo, Nextcloud, a Mastodon server... Most of those require auth behind my Kanidm instance to actually do anything. Well and most of them have APIs that are much better than "scrape the universe".
They showed up when the AI money did. The evidence is circumstantial, but… some of them are remarkably well engineered (from a “how difficult is it to identify this traffic” perspective, in a way that never existed before (I have been running a quite sizeable site for 8 years, over 200k registered users, and you don’t need to register to use 99% of it).
I run a quite large website and there are a few patterns.
The usage is extremely quick, and follows easy-to-spot patterns. We noticed a spike in bounce rate.
They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.
Then there's the crazy spikes in visits from specific countries, pretty much scraping the entire content. Often from pools of IPs. In some cases had 30% unexplained (meaning: it wasn't viral or a marketing campaign) random sustained increases in traffic.
There's also the fact they don't interact with the complicated widgets, so zero XHR requests other than analytics pings.
They also don't cause spikes in Google Analytics, so I assume it's blocked, but they show up in logs and in the internal analytics.
It's not enough to DDOS the website at all, but it's a lot of noise in statistics that we gotta learn to filter.
> They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.
I’ve triggered this kind of “bot protection” right here on Hacker News many times. I did that by having a bunch of Hacker News pages open and then closing and reopening my browser. I’ve also triggered it by opening a bunch of links in the background too quickly. I’ve also triggered it by reading the article, then clicking back and upvoting/favouriting too quickly. I’m also located in Singapore, which people have started to advocate for blocking here recently.
A single non-bot legitimate user can easily trigger these kinds of heuristics just by using the site in a way you don’t expect. This can affect some users disproportionately more than others, e.g. disabled people who need to use assistive technology.
It's circumstantial evidence, but Occam's Razor also applies.
It's not a hostile DOS in the traditional sense (I've mitigated a few of those) - no "pay us to make it stop", no pattern to the requests other than "fetch every unique URL a few times".
It wasn't happening until financial incentives to gather large datasets for AI training appeared.
Bad actors (using residential proxies & claiming to be a real browser) mostly showed up after folk started blocking ones that identified themselves as AI scrapers.
It's obvious to blame AI training because there's a shortage of better explanations. Who else would be paying for these (expensive) residential botnets, only to use them to (eg) web-scrape wikipedia (which offers free downloads of its content in a structured format)?
The simplest explanation of the technical behavior is "a bot coded to follow every link it sees & save the results", and the simplest explanation of the motive to run such a bot is "to train a large language model".
Exactly. They (and most of all, Big G) stand to profit greatly from this browser discrimination. What better than to make more sites use them by launching DDoS attacks in the name of "AI scraping".
A small, single EU country focused non-static e-commerce, with proper robots.txt instructions that worked perfectly well in the search & co bots -only "era" with rate limiting for nginx/php-fpm setup - is kinda struggling without CF to handle 15000 requests per 15 minutes, coming from Chrome "users" from IPv6. Best so far was an avg. server load in htop = 40 on an 8-core server x_x
That's 16.6rps. A single guy holding the F5 key on chrome can generate that much traffic and take down your website. That kind of performance was never acceptable.
People will always reframe their request numbers to avoid stating their pitiful requests per second numbers, it's hilarious. "This thing is handling hundreds of thousands of requests per day!" Like cool, you're barely making it double digit requests per second.
Maybe a plain WordPress install. Run something like WooCommerce and install a bunch of plugins to get the functionality that WordPress and WooCommerce should have built-in, and suddenly a cheap VPS can only handle 2 or 3 requests per second.
It's phenomenal how inefficient the WordPress/WooCommerce stack is.
Though the main issue I'm seeing is credit card testing, not scraping.
And I'm ideologically opposed to using a CDN (because it shouldn't be needed for such a small site!) so it's somewhat a self-inflicted problem...
"Security" plugins are also HUGE problem here, most of them turns "few cached DB SELECTs" (or static file read if you use caching plugin) into now a bunch of inserts, just to log/analyze "offender" IP and maybe block it, in many cases turning "blocking offender" to be more costly that would be serving the page without the security plugin
You can calculate traffic stats for a day by IPs/subnets and probably bots will stand out. If they are using IPv6 you can figure out the ASN and block it completely.
You get downvoted for these opinions but I agree. Most people that complain that their servers get hammered by AI bots are those that run very unoptimized servers that can only handle like 100 rps. I've never had any issues with any of my moderately optimized websites. A $10 VPS can handle sooo much traffic.
I think people get annoyed when it's suggested they spend time optimising or even re-writing their websites to handle high traffic loads just to cater to AI bots ripping their content.
It's also not always easy to do. I run a small wiki which is fairly optimised, nearly every page manages at least ~3k rps on a small VPS. The only exception is the diff page which is ~150 rps. Optimising that while still giving good output isn't that easy, but the wiki doesn't have many users so that would be fine if it wasn't for the AI bots.
The AI bots ignore robots.txt and were initially hitting the site with ~1k rps crawling every combination. Even that would be manageable as there's currently ~150,000 combinations, except they kept re-crawling the whole lot each day. The server could manage it but it was a massive waste of resources.
They were using residential IPs and only sending 1 request from each IP making it impossible to block. In the end I gave up and put a Cloudflare challenge in front of it. I don't want to use Cloudflare but the alternative is forcing users to login to view diffs or remove them entirely.
What I do is have more strict rate limits for non logged in users. You tell them to log in if they hit the rate limit. For non logged in users, you have a rate limit not just for IP, but also for /24 and /16. Forget about IPv6, IPv4 scarcity is a feature not a bug.
The bot I had was using unique IPs for each request. Some were from cloud providers but most were just random residential ISPs. I couldn't see any obvious connections so rate limiting would've had to be a global rate limit.
Each IP only makes ~1 request though so easy to detect after the fact.
I guess they will run out of IPs at some point so maybe if I had logged each one forever and shown a challenge only to them, it would have fixed it eventually. Just depends how big their pool of IPs is.
You were getting 1k rps, and each request was from an unique IP? So after an hour you got hit by 3.6M different IPs? And all from uncorrelated /16s? That seems hard to believe. Not that I don't believe you, it's just hard for me to grasp that whoever was scraping you had such a large and distributed swarm.
This is called rotating residential proxy service. You can buy it off grey market sites that are probably getting it from botnet operators. It costs about $2-$5 per GB.
There really isn't a good reason for a wiki (or git host) to provide diffs between arbitrary revisions to unauthenticated users. Limit it to diffs compared to previous (which can be cached) and this problem goes away.
In any case, such labyrinths of expensive dynamically generated pages are no excuse for subjecting people requesting the start page to bot checks.
Curious, but how do the bots figure out the combinations? Or do you have links to the diffs from other sites? I assume the diff takes two files in query parameters or something.
I managed to solve my scraper problems without optimizing much, but if I had to optimize I think the only option might be "don't use mediawiki" and that's an extremely obnoxious solution. Though maybe I could get there by throttling specific kinds of pages.
Same. Tritium and the blog have done stents on the front page here and high traffic subreddits and that plus bots has never been a problem. UX could be improved through a CDN but even that isn’t worth the trade-off for us at the moment.
If you're in any way semi-popular and a decent size, you're gonna get hammered. PortableApps.com was partially offline for weeks due to China-based AI scrapers. You block the useragent, they start hitting you with another one from the same IP in the same way. You block the IP, they switch to another. You block the subnet, they use another. At one point it was nearly a thousand different IPs from around China hammering away. For all intents and purposes, a DDoS. This wasn't a little "extra load", this was load that was thousands of times beyond what our legitimate userbase was using.
And if you're thinking about blocking all of China, while this particular AI bot didn't use them, a bunch of other ones I've encountered use VPNs and hacked clients worldwide.
I don't think it's just privacy, it also increasingly turns the web itself into a walled garden. The end result is that websites can only ever be accessed by "approved" clients - the latest Chrome, Edge, Safari and if you're lucky Firefox - and nothing else.
I haven't ever noticed Cloudflare having any issues on Firefox, so presumably that implies any unilateral actions in web standards have been worked around by CF to provide the service to Firefox as well.
I'm pretty frequently blocked by Cloudflare when I use Firefox on OpenBSD -- apparently it's too suspicious of a combination for their liking, or something. Even on Linux I've occasionally had issues. I've had to email site operators to ask them to change their configuration so I can actually be a customer of their business.
> we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system
So delegalize it. Strip searching everyone to paper over the fact that the societal contract has been broken only delays that.
I think there's some chance we get a "proof of purchase" system where there is some entity that takes a $10 payment to give out a unique identity token that you need to present to visit most sites. if you have a revocation process for ones used for bad actors, it seems like it would work pretty well.
If the bad guys also had to pay $50/month/IP it would probably work.
The bad guys don't pay that much. And sometimes the bad guys actually use the IPs of other people (botnets on residential IPs) and don't pay anything at all.
You can easily calculate which IPs/networks bots are using by looking at where most traffic comes from and who requests lot of pages with non-human speed.
The most plausible near-term path is probably micropayments embedded invisibly in AI agents. Your agent that has learned what you value and can make a reasonable decision to allow a micropayment for certain content pays on your behalf without requiring a conscious decision each time, eliminating the mental transaction cost problem entirely. It's the mental transaction cost that arguably led to the failure of the micro payment model back in the early 2000s.
Although the cynical part of me says that this will result in malicious actors trying to trick agents into giving out a bunch of micro payments. There are counter defenses that can help detect and compensate for that, but perhaps the best we will be able to do is prompt user with the default agent recommendation.
I can no longer access any website that's "protected" by Cloudflare. As soon a website enables that stuff… "Shoot, another one bites the dust." I wonder if the website owners realise at all how many actual users they lose by this sort of "protection."
Cloudflare will just tell them that 70% traffic drop is because 70% of their traffic was bots, and everything is working fine, and hey, don't you want to upgrade to a paid plan to block 50% of the remainder? Think about how many bots will be blocked with that upgrade!
I'm one of those who have enabled cloudflare on all of the sites I maintain. Additionally, Added turnstile on every form.
I know some actual users get blocked. But the amount of spam we get without it, the amount of bot traffic simply overwhelming the server... It is just too much.
Recently I also hard blocked all IPs from china Singapore India Pakistan Russia and whole of africa. Do I want to do it? No. But the amount of bot traffic and corresponding spam is a bigger problem :(
> I know some actual users get blocked. But the amount of spam we get without it, the amount of bot traffic simply overwhelming the server... It is just too much.
So why not just shut down the website? Or remove the form entirely? That will ensure that you get no spam, right?
One of the core tenets of system design is Availability. If your service is not available - if your forms are blocking legitimate users - then why are you pretending to have a form submission feature at all? Just to frustrate users?
> One of the core tenets of system design is Availability. If your service is not available
The service won't be available to anybody because of overwhelming unwanted traffic. Now it's available for most potential users. You're speaking econ 101 when everyone else has played out iterated prisoner's dilemmas.
> So why not just shut down the website? Or remove the form entirely? That will ensure that you get no spam, right?
Turns out that people have a tolerance for a non-zero amount of work, but still have a limit.
Suggesting "turn off your website" is does not account for the desire to also provide some access.
Treat people who host content as humans, just as we must treat users as humans. There are tradeoffs, suggesting "shut down your website unless you provide access everywhere" is worse on all fronts for everyone.
> There are tradeoffs, suggesting "shut down your website unless you provide access everywhere" is worse on all fronts for everyone.
Maybe, maybe not.
If block-heavy websites shut down entirely, we lose some content, but other content moves to block-minimal sites and the average user might be able to access more.
Also if there's no blocking crutch, and people get pushed into shutdown and are mad about it, they might fight harder for anti-spam technology and legal enforcement, which could improve the situation.
>I wonder if the website owners realise at all how many actual users they lose by this sort of "protection."
How many people do you think are browsing with a weird enough config (eg. custom browser like OP, or some weird config like firefox with fingerprinting protection on a raspeberry pi) to trip cloudflare's protection?
Well… I know plenty people in my circle affected by this. Just have a slightly outdated system you simply can't afford to update: it's way to easy to get cut off like this. IMHO, a rather systematic discrimination of poorer people.
I got locked out of some websites by Cloudflare Turnstile on some very standard configurations, like an iPhone on Safari, or a Windows 11 desktop with Firefox or Edge, neither with a VPN on. I never found out why.
it's probably because a scraper farm updated their services to latest, and there was a window where fingerprinting was unable to differentiate.
We had all of our Devs Pixels get blocked, and after talking to CF, it was because Internet archive was rebooted their scraping farm, all the devices stampeded and overwhelmed the known bot safeguards, and those tags were added across the board. CF gives sites the tools to tune what is getting blocked, we bumped the sensitivity down to 25 and haven't had many complaints (despite having a very vocal community)
The most common complaint is users' IP address getting blocked because of compromised devices
Does not have to be weird, at least once it happened to me that their strictest settings simply banned something like major portion of internet users in my country - to the point that if you had FTTH you were likely blocked.
And no, it wasn't due to a country-based block selected by site operator.
I use a plain Firefox on a plain Windows 11 PC on a plain regular mass market ISP in a developed country and I get completely blocked by websites daily.
At least let me complete a "prove you are human" challenge or something, but don't outright ban my IP address?
Do you by chance have that installed? I don't use Cloudflare but I am curious if that code can scrape my silly blog? [1] Trying to pick the appropriate article... I'm guessing it can. I don't do the fancy javascript or TLS fingerprint inspections, just some janky hill-billy protections, silly redirects and Antarctic voodoo.
As someone responsible for mitigating card testing "attacks", account harvesting, and DDOS attacks..
It is unfortunate, but the ISP industries(from telco up to transit) and CC industries aren't providing a lot of great options. This idea that people are doing things "without a second thought" is usually false when it comes to businesses.
They sometimes have to comply with legal requests (which I understand), but at the same time they have a huge market share - which means that the internet is becoming less and less decentralized and more in their control. We've seen the effects of that in previous outages...
I think what gives me anxiety about the whole situation is:
1. If X% of the population gets wrongly branded with the scarlet letter B[ot], how do they appeal and get it fixed?
2. How will sites notice and know if their choice of "bot protection" is losing them X% of users/customers/job-seekers etc.? If it's a really robust system, they'll never even see the complaints either...
3. If everyone does detect that something is awry, will it be such a monopoly that there's no choice but to let it happen?
I use a cellphone internet provider, there have been many a sites I couldn't access because or cloudflare or stupid recaptcha. i know damn well what a bicycle, bus, traffic light or stairs is.
>I am extremely worried about how so many seem to have outsourced the control over who can access their websites to a company, with no second thoughts whatsoever.
I think the Web is on its last legs, anyway. Generative AI and LLM-instead-of-search has destroyed what little value remained.
Governments too. It's inevitable that the international network will fracture into multiple national networks with heavy filtering at the borders as each country scrambles to impose their laws on it.
I'm glad to have known the true internet before its demise. Truly one of the wonders of humanity.
It's just one more facet of the enshittoscene, the era where actual product quality is completely irrelevant. Put it in the same bucket as websites that lag when you scroll, apps that refuse to show you video without a huge play/pause button overlaid in the middle of it that never goes away, and the movie Melania. My hypothesis is that billion-dollar businesses no longer exist to sell things to customers, but only to impress other billionaires to get their investment money.
You can use Firefox with different profiles and configure it to launch particular profile directly, without launching default profile and using about:profiles.
Firefox with a non-default profile can be created like that:
./firefox -CreateProfile "profile-name /home/user/.mozilla/firefox/profile-dir/"
# For, say, cloudflare that would be:
./firefox -CreateProfile "cloudflare /home/user/.mozilla/firefox/cloudflare/"
And you can launch it like that:
./firefox -profile "/home/user/.mozilla/firefox/profile-dir/"
# For cloudflare that would be:
./firefox -profile "/home/user/.mozilla/firefox/cloudflare/"
So, given that /usr/bin/firefox is just a shell script, you can
- create a copy of it, say, /usr/bin/firefox-cloudflare
- adjust the relevant line, adding the -profile argument
If you use an icon to run firefox (say, /usr/share/applications/firefox.desktop), you'll need to do copy/adjust line for the icon.
Of course, "./firefox" from examples above should be replaced with the actual path to executable. For default installation of Firefox the path would be in /usr/bin/firefox script.
So, you can have a separate profiles for something sensitive/invasive (linkedin, cloudflare, shops, banks, etc.) and then you can have a separate profile for everything else.
And each profile can have its own set of extensions.
I'd argue, that for some, CLI path is actually cleaner.
You see, the way described above creates entirely separate points of entry, and you don't have to go to the central menu to launch specific profile.
It eliminates one step (Profile Manager, about:profiles or whatever) allowing you to get faster to the desired profile - same way you'd launch a default profile.
It's logical separation too. It's like separate browsers from UX standpoint (they do use the same distribution though ...unless they aren't - you can configure different distributions for different profiles - nothing stops you from that).
I'm just leaving the information about the gui option to other who may not be aware that it can be done from the gui too, and think its difficult to do in Firefox.
I think the idea is that they have the functionality that cloudflare is using to generate the fingerprint (like webGL in this case) disabled in their non-cloudflare profile and only use the cloudflare profile to do things they have to that are behind cloudflare
that's why I use completely different browsers with different settings. my CF-friendly one (not my daily driver) is `firejail --private chromium` so it always starts with a clean temporary profile
They actually have at least 3 kinds of profile:
1. containers - As they say its somekind of sandbox, technically a profile
2. profiles that are accesible through about:proflies, which they had for years, and probably the one you are talking about...
3. New profiles that comes with a pop-up much like how chromium browsers shows it
Odd - they've had that for years, but only on the command line. Wonder if it's different under the hood? They also have firefox containers which also never quite became a first-class feature (you have to install a plugin).
I mean all bot protection is useless at the end of the day, every time I have to bypass it I can do so in roughly 3 to 5 hours both 2 years and and more recently around 1 month ago. 2 years ago it was an absolute joke and only took me 30 minutes.
Well I mean maybe it wasn't useless 2 years ago, but in the age of AI it definitely is.
> I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection", but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go - completely destroying the privacy of everyone involved.
Bot protection with fingerprinting is just an illusion. Any signals like this which is on client side can be spoofed by an above average person. Fingerprinting is just way to consolidate the market for advertising business. Assigning Reputation to residential IP addresses and commercial blocks is is another approach to achieve the desired result. Providers would be a lot more careful to allow their IP addresses for misuses, however turns out that it would bring down the DDOS business on both sides, attackers and protectors.
Ironically, more than often its the same companies that invest in building their own bots and finding ways to stop bots from other companies.
> Bot protection with fingerprinting is just an illusion. Any signals like this which is on client side can be spoofed by an above average person.
At the upper bound, fraud can always be committed by paying real people with real accounts to perform the desired action in a way that is 100% truly indistinguishable from organic. There's fundamentally actual prevention technique at the limit.
So the entire game is only "increasing the costs until it's not viable ROI", not "holistically prevent", which is why fingerprinting is a relevant technique here.
> entire game is only "increasing the costs until it's not viable ROI", not "holistically prevent", which is why fingerprinting is a relevant technique here.
As per cloudlare's own report, about 78% of the DDOS attacks are at the network layer where the fingerprinting technique is not useful.
DDOS is done against targets for certain reasons, most businesses are not even viable targets for everyone.
However letting everyone being fingerprinted on the pretext of solving the DDOS is where the privacy gets compromised (not much of it is left though). Some search engines did it indirectly by letting people use tag managers for free in their website and then utilize the data for their advertising business.
Relatively the end game is same, its just how these companies are approaching it.
JA3 fingerprinting is really not a serious deterrent, there are many ways to get around that. curl-impersonate works. You can even just use an actual Chrome instance with the devtools protocol, seems to pass as long as you don't use headless mode.
The WebGL fingerprinting thing is cute, too. I guess it'll buy them some time since off-the-shelf solutions are going to probably not handle this well yet. That said, as long as the reward for bypassing turnstile and other anti-bot protections remains high, these things really can't do much. A decently resourced adversary can probably come up with a dozen different approaches to make this less useful. Without really looking into it much, my kneejerk is you could probably tweak Mesa to have deterministically random behavior for whatever edge cases it looks for, but you could also just have lots of different GPU/driver combos to proxy to. The web gets less open, but in an asymmetrical way. If you really have an incentive to keep botting, you'll surely find a way.
The next step is to fully give up and just essentially implement WEI. And then the bot problem disappears?
Nope. Botting will still hold tremendous value, so likely there will be many crafty workarounds and bypasses over time. And there will be countermeasures for those and workarounds for that. Guess we'll start to find out who actually has the resources and incentives to keep botting in this environment.
So what's the real solution? Well the most obvious thing to do would be to make botting less valuable. Can we? I dunno. It may have been a mistake to move so many important things to the Internet after all. I mean, some of this is just threat actors catching up with what's possible and was inevitable to begin with. But, some of it is just trying to find solutions to problems that were unnecessary to begin with. Or failing to implement solutions despite an obvious need to do so.
There are a lot of threads to pull on, here. Account takeover still holds tremendous value to threat actors. Why? In my opinion, it's because passkeys were a tremendous failure, no matter what adoption shows. If we wanted to just improve security for users, I think we didn't need to restructure the internet around another authentication mechanism that of course, provides attestation capabilities, we could've just improved on passwords. For more secure handling of passwords, PAKEs exist. Password managers exist. For anti-phishing, TOTPs exist. What if you could have the exact same passkey experience, but in such a way that everything can gracefully fallback to just passwords and TOTP, because they're the real keymatter at the end of it? Add a web standard that lets browsers and browser extensions hook into the login process, standardize PAKEs as part of the web. Cross-vendor syncronization? A problem easily solved if we ever wanted to.
Instead of that, we got the dumbest possible world. Passkeys are sometimes available, but often not. Can you sync your passkeys across devices? Probably, maybe they have blacklisted KeepassXC by now so maybe I can't :)
But a lot of stuff doesn't even offer me the option to use passkeys, so they still use passwords. Can I enter my password to log in still? No, of course not. See, I will helpfully get the option to enter my password, in addition to the option to use email or SMS, the most secure authentication scheme known to Man, but if I actually select password and enter my secure password from my secure password manager, what I get to find out is that the password option is actually password and email or SMS and there's no option to use TOTP. Oh, and you randomly get logged out for no reason sometimes.
Some of the bots will probably disappear. Like, whatever bot is throwing me several terabytes of nonsense traffic every month will probably eventually disappear since they're wasting so much bandwidth on doing literally nothing. I have no idea what the point is, but I know it can't be terribly valuable for them, and it's not terribly expensive for me. I'd love to know who the hell is doing that and why, though.
But since the web is ran mostly by crap companies like Google, it will never get its shit together, and we will get solutions like WEI and identitity verification to solve problems that were entirely manufactured (or caused by a significant lack therefore of) in the first place.
> I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection", but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go - completely destroying the privacy of everyone involved.
I hate what the anti scrapper mechanisms have become but it really is the lesser evil. The alternative for many small operators is to just completely shutdown.
> Plus privacy.resistfingerprinting isn't enabled even when selecting "Strict" "Enhanced Privacy Protection" in the settings, great job there Mozilla.
For good reason. I've run that setting for ages but I kept having to disable it and add workarounds because websites would break in weird ways. Timezones in scheduling websites being messed up nearly made me miss a couple of appointments. There's no way to tell the user Firefox isn't broken without displaying a permanent banner like "if websites are broken in any way or you see weird glitches or your computer's time is wrong or fonts look weird or videos don't always work right, click here to disable fingerprinting protection".
Interestingly, Turnstile breaks with resistfingerprinting but works with fingerprintingProtection, I guess the latter takes this crap into account.
> Timezones in scheduling websites being messed up nearly made me miss a couple of appointments.
The reason for spoofing the time zone (to UTC) is that it is one of the many things used to fingerprint users. There is an unintended side effect however: a mismatch with the IP geolocation could out you as a VPN user even if no VPN is actually used.
It's quite incredible how much you can learn about a person just by knowing their user agent, country of origin and list of preferred languages.
When Youtube still supported trends, visiting the trends page from Poland with Safari set to English gave you really interesting results. Mostly intellectually-stimulating content from channels like Veritasium, with a smattering of reviews, trailers, music and focus soundtracks thrown in. Meanwhile, visiting that same page on (Windows) Chrome set to Polish gave you the typical "you won't believe what this man just did!!!" crap.
Even with resistFingerprinting, websites will be able to fingerprint you. There is no full immunity against fingerprinting.
Websites already break often with the strictest protections enabled, adding a "super duper strict protections" mode will just lead to bug reports. Even more-than-bare-basic tracking prevention has HN threads full of comments like "doesn't work on <Firefox fork>" because they don't see the connection between fingerprinting protection, WebRTC/WebGL/WebGPU, and websites not working.
People who are willing to take that bet can enable it in about:config.
> Websites already break often with the strictest protections enabled, adding a "super duper strict protections" mode will just lead to bug reports.
That’s what I‘m saying. They already break because of other effects of the strict settings, so what is the benefit of leaving resistFingerprinting turn off?
> There is no full immunity against fingerprinting.
There is 0 immunity if you don’t even try.
Strict means, do what you can, not do somethings strict other not so strict and others ignore completely.
"If they know you're spoofing, you're not spoofing hard enough."
This stupid "war against bots" is going to lead to the downfall of the Internet and effectively turn it into another walled garden where only "approved" (anti-)user agents are allowed. Don't fall for the nonsense about "AI scrapers" --- it's just a way to manufacture consent.
Idk, if bots ate hammering your server then setup rate limits. If you have content that you don't want others to have access to, don't serve it with a webserver.
I used to just start giving any IP downloading way too much a redirect to multi-tb NASA images. This was a long time ago but it was surprisingly how many would follow redirects and never time out. Wouldn't see a request again for hours and then its right back to downloading a new part of the sky.
Those images also used to crash all the early GUI irc and chat clients that showed inline images without size checks...
How were you tracking each IP address's data usage? Did you parse the logs every request? Store usage in a database? At the application or webserver level?
Webalayzer! I'm not sure there were really any other options at the time other than writing your own. Parsed the apache logs and gave you pretty detailed results and you could see the usage (in kb, which tells you how long ago this was!) broken down by date and IP.
Once you added a redirect rule for the IP to apache you'd just check your log and see the IP that was hitting you every couple of minutes poofed for a good few hours.
There is something to be said for "one way indexes."
Imagine you run a company register for a local government. You want to let people look up companies by their registration number (which they must disclose in all communications to you) to see if they're legit and whether any warnings have been raised against them. You don't want unscrupulous marketers to just be able to `SELECT * FROM companies WHERE type='nail_salon' AND city='london'`.
If you aren't super strict about scraping, some shadowy business in Neverland, completely unconcerned with following your laws, will build that database.
> Imagine you run a company register for a local government.
Is this data not public for some reason? I think it will not hurt if there are multiple copies spread between public offices and private companies. What really hurts is a private company hammering your webserver for their own profit. They should get their own copy.
If the purpose of the index is to allow people to lookup registration and warnings, probably just serve the list. This is public information and doesn't need to be gated. CSV header could be:
This. What even is the point of blocking scapers if Google consumes your content anyway and serves it as an AI answer?
These are sad times we're living as far as openness of the web goes. People would have less of a scraping problem if their websites didn't ship with 20MB of JS.
I'm maintaining a minority browser[0] and as of a couple of weeks this is affecting several of our users[1]. While I'm currently not considering this a browser bug (one could be involved, of course), more eyes are better and any help or ideas on improving or mitigating the situation would be appreciated.
"Your browser appears suspicious because it looks like you are trying to hide your identity"
Another case of the much predicted downfall of freedom due to "people who hide themselves must have something to hide, so they are automatically suspicious"
CF business model heavily relies on fearmongering, so what we can expect?
They send these emails you know? "CF saved you XXX Gb of data and protected your from YYY attacks". I have few high load web sites which I turned CF on for a while. Knowing my traffic pretty well, I can say these "CF saved you XXX Gb of data and protected your from YYY attacks" is absolute bullshit with numbers greatly exaggerated.
Since wwe can't catch them on this lie, they can put any number they like to make their "service" attrractive.
I can only assume that every time I back out of these sites because I don't want to check the box or just don't want to wait a few seconds that is marketed to the site owner as a GREAT VICTORY as I am clearly a EVIL BOT that they have defended the site from.
Is there a deal between Google and Cloudflare to make non-Chrome browsers harder to use? The pressure to use Chrome keeps increasing, and the amount of ad filtering you can do in Chrome keeps decreasing.
I assume it's business people finding it to be a better "bang for their buck" implementation time-wise or lazy developers who don't use Firefox for their testing phase. I've seen it so many times. At a previous company, I was the only person using Firefox daily and I would catch bugs a few times a year during PRs for things that worked fine in Chrome, but not in Firefox. Oftentimes the suggestion was just to leave it because "who uses Firefox?"
As someone who runs Firefox on both Linux and Android, with Enhanced Tracking Protection enabled, and tries to use web over native mobile apps wherever possible ... I really don't feel this at all?
This is a concerning trend. Turnstile was marketed as a privacy-respecting CAPTCHA alternative, but requiring WebGL fingerprinting undermines that entirely. At this point what's the actual difference between this and the tracking they claimed to replace?
At the time, reCAPTCHA was the alternative and it was effectively working as a giant ad targeting data collection tool. I'm pretty sure Google have now back tracked from this.
WebGL finger printing is just one of many things you need to do if you actually want to stop automation. There is no way round it other than requiring ID of some sort.
It is very similar to kernel modules for game anti-cheats. Soon, websites will work on unmodified Windows and Mac computers only, with a signed cloudflare kernel driver installed. :/ They are completrly destroying the web.
You're not quite going far enough. Cloudflare requires that you allow it to attack your browser, as a sort of virtual hazing ritual, before you're allowed into the club. That this hazing makes your browser vulnerable to attacks by others too is a side effect that bothers them not at all.
Ive been concerned about Cloudflare turnstile fingerprinting ever since I started being forced to "prove I was human" on my anonymous X/Twitter accounts anytime I'd say something anti police/government/military.
I would get locked out of the account on all devices after saying these things until I compeleted their turnstile. For many accounts I just never used them again.
I could go more into this, but im highly suspicious of Cloudflare and of course X/Twitter in this regard. Ive been reccomend people to follow on anonymous twitter accounts for people I went to elementary school with and havent spoken to in years and have no digital connection to. Its very weird.
WebGL fingerprinting is of course an attack and a unintended use of the WebGL API. Browser vendors should respond to this misuse somehow (reputation based blacklist?).
I always like the axiom with crime that once X% of the population are violating a statute then it should probably struck off. Recreational drugs being the obvious example.
If randomized canvas stuff was cracked down upon as a bot thing but now everyone with a copy of Firefox is doing it, maybe Cloudflare should just “legalize” it?
I tested this extension that I've been using for a long time on the turnstile page and it got through, fwiw. I think it's a bit more subtle than how resistfingerprinting works but not sure what the privacy tradeoff is.
JSshelter looks cool, I'm not familiar but this makes it seem like it operates more like resistfingerprinting by blocking outright instead of noise injection, at the expense of more broken sites?
What all security extensions do you run? After running into issues over the years, with extensions doing multiple things that fight each other, I switched to trying to block via ublock origin as much as possible, then prefer other extensions to just do one thing to extend coverage, like this one. Makes it much easier to troubleshoot/exclude/disable when it breaks something vs. fiddling in settings.
Speaking from the scraper’s perspective, I like proof of work; a ten year old 96-core server will cost a couple of quid to run for a few hours and will grab an absurd number of pages thanks to the access granted by repeatedly solving proofs of work. Small slick codebases too!
There's also the Anubis idea where your PoW is persistent until your IP address or session cookie changes, so you get to skip PoW in exchange for making yourself identifiable, which means the PoW can then be ramped up to take a couple of minutes.
I don't use Anubis though. I just make my site not take five seconds to render a page so bots can overload it easily? It's not actually that hard?
I think we're talking about 2 different things. PoW is annoying for basic scrapers but it really doesn't affect enterprise grade bot operations with access to unlimited residential proxies.
Depends on what type of scraping you're trying to stop. For the dumb scrapers that would try to scrape every page on a git forge (for which there are a bazillion pages for a modest project, because of how the site works), yeah it might deter them enough to stop. For anything high value (eg. reddit comments or retail prices), 10s of cpu time isn't going to stop them.
It will not scare away bots but 10 seconds of wait (CPU or only a sleep) will turn away many real users. "This site is so slow, I'll use something else." A kind of reverse captcha.
Sure, the whole premise is exactly that proof of work reduces the value of scraping, while having negligible impact on users. If the data is so valuable that bot operators are willing to pay 10s of cpu, then other measures are necessary.
Nevertheless even for these high value cases, you can still argue that it disincentivizes the business model, it becomes less efficient.
If it's high value, there isn't really much you can do that will be completely effective. Traditional captchas can often be beaten by AI, or by "captcha farms" where impoverished people are paid pennies to complete captchas. Fingerprinting can be beaten by using a full browser to make the requests. Basically anything you do is just a matter of making it more expensive for bots to access it.
Beating fingerprinting and beating traditional captcha is far more expensive than solving pow. Pow doesn't stop anyone, not even the most novice bot operators
Behavioral signals are the usual answer: risk-scored, invisible challenges; proof-of-work (cost without identity, though it taxes mobile); and signup-velocity/rate limits that stop cheap abuse before any challenge fires. The reason fingerprinting wins anyway is that it requires less operator effort, not that it is the only thing that works.
With a tuned cool down period this isn't a problem, especially if you frequent the sites. OpenWRT uses Anubis and usually when I need to peruse their site I'm on a very low-end device. I prefer waiting much more over finding Waldos
But in principle I agree that there's no good answer to this, scraping _is_ useful and I bet most of us here had scraped something, it is AI company and their use of human's material for training without consent and return that led us to this (I know botting exists in forum since forum is a thing but it is easily solved by human moderators and keyword filter)
It also requires JavaScript. I like to have JS off by default since running code on my machine is a privilege—one that I opt into, not the the site owner’s choice. This is frustrating since these blockers don’t let me know if the site is trustworthy first before needing to solve a Sudoku for Cloudflare or calculating useless hashes for Anubis.
But after you’ve completed the Anubis PoW challenge for a site, it remains valid for some amount of time.
So it’s not quite as horrible as it sounds.
I have setting up Anubis for my own sites on my todo list. And I wish more people did it too. I don’t really mind waiting a little bit extra every now and then before the page loads. What I do mind is ReCaptcha asking me to click all the pictures with buses in them etc. And especially when I have to do it several times over before it’s happy. I’d rather wait a minute for a page to load than to ever solve a ReCaptcha again, if given the choice.
I don't know about you, but if a random webpage takes 60+ seconds to load, I just close it and choose to never interact with that site again (unless it's my bank, which is a real and annoying occurrence).
My guess is its an implementation error, not an hardware limitation. I have two 10-year-old devices and one passes instantaneously while the other halts for a good half minute every time.
One of unexpected outcomes from AI-induced hardware shortage may be that, in fact, compute won’t be getting cheaper and may in fact get more expensive…
Anubis is designed to stop a certain class of badly behaved bots. It intentionally doesn't run if a bot identifies itself with a UA, such as Googlebot, because then you can rate limit it or block by UA and with other tools.
Anubis is active when a user agent looks like a web browser (e.g. contains the "Mozilla" substring every major browser uses). The reverse proxy serves an interstitial page that does a proof-of-work check, validated server side, setting a cookie if it passes.
This means a legitimate user won't constantly get the proof of work check, because they already passed it. But AI bots rotating through tons of residential IPs to scrape your forum or git forge or whatever will be slowed down.
Overall, I like the idea. It's unobtrusive, privacy preserving, and seems to be working out well for a lot of sites.
It doesn't. It slows them down. To stop bots you need to employ the full suite of tools, fingerprinting, IP rep, behavioural analysis. Anubis will slow down your basic scrapers that try to crawl the entire web but it is useless against actual bots
Bots don't [currently] execute JavaScript or follow complicated redirects.
They don't now, but enough "high value to the bots" pages turning on JS or complicated redirects will simply result in the bot authors adding JS execution or redirect following so they can continue "botting" the sites they want to scrape.
It's a hole with no bottom. Each one-up on the anti-bot side will eventually be handled on the bot side.
Doesn't this mean we just need to make the webgl fingerprint resistance implementation smarter? Instead of explicitly rejecting webgl access or responding with dummy data, respond with data that is random within space of N common and reproducible patterns. E.g. emulate webgl implementation of some low spec but actually popular devices.
The last screenshot in the OP article mentions that "a browser extension... adding random noise to canvas data" can be detected. Which isn't to say this perfectly detects all such randomization, but it's certainly an active part of the arms race.
Yes but the idea is that the protection should be part of the browser itself, then it becomes the expected norm AND isn't really "detectable" since there's no extension to redefine javascript variables. Scraper-friendly solutions like Camoufox or CloakBrowser make such changes to avoid having the same fingerprint every time while still appearing normal.
All of those advanced features should be enabled on a per-website basis but unfortunately even browsers whose marketing focuses on privacy don't allow you to do that. Same with TLS root CA certificates, there is no way to configure that a certain CA can only create certificates for certain domains.
Privacy and Bot defense are opposite ends of the same fulcrum. If you permit privacy, the site/service has to trust users to behave and follow the rules. If you track users, then the users have to trust the site/service owners not to abuse that trust. There isn't really an in between.
So if you want privacy, you have to accept poor and sometimes insecure services.
Bad optics aside, it doesn't actually reflect reality. See my other comment. You can enable basically all the privacy settings and still pass turnstile. Tor browser in a VM passes it, of all things.
It tripped "Canvas Randomization Detected". See the last screenshot.
Cloudflare's demo page still treats that as a pass, but complains about it. As is often the case with Cloudflare, I expect that they'll then take no responsibility for sites that use more aggressive settings.
You hiding things from them automatically lots automatically bins you with agents having a reason to hide things from them.
Which, to be clear, is the entire problem: given how much of the internet goes through them, they should have enough alternative signals as to wether you’re not a bad actor that are stronger than this specific one.
However, this also presents the problem that there’s barely any users in their base with your exact configuration, so getting any actual solutions might just take forever.
...in the age of AI, does anyone have an actual solution for keeping out bots while preserving the privacy of humans?
Obviously this is terrible, but I think there's a possibility it's the least terrible option? Another option is IP reputation, which I think is worse. Or scanning a code with a non-rooted phone, which I think is even worse than that!
The only solution is regulation. If all content created by anyone has a copyright, how does an implicit opt-in (which is what happens if you don't create a robots.txt file for your website) for scraping make any sense? Moreover, even if you have a robots.txt, AI (or whatever) bots often don't respect it (or use workarounds - they outsource scraping of such "restricted" sites to unethical third-parties to get the data; Meta has even resorted to piracy, openly!). So clearly, the logic and the "honour system" has failed.
Cloudflare, Google Captcha, HCaptcha etc. are all shitty technical solutions because, as we are all discovering, it comes at the cost of our privacy (i.e. our personal data may monetise these services) and / or our computing resource and time. If current copyright laws aren't sufficient to prevent this, we have to acknowledge the system is broken. The answer could be enhancing it with some kind of Digital Millennium Copyright Act (DMCA) -like laws, but in favour of the creators against BigTech or rogue actors.
Or you could let information be free, at least the stuff that’s on the public net.
As for issues like bots overloading websites or using too many resources scaling laws will take care of it quickly, it’s not like you can’t serve thousands of RPS from a Raspberry Pi these days.
I don't think regulation will stop web scraping, not least of which because it can be done from locations outside the jurisdiction of the regulations.
> we have to acknowledge the system is broken
The system is broken. It probably takes, what, 10 seconds or less to use a residential or foreign proxy, 6+ months to internationally track and prosecute a single offender? So like a million times more effort going the regulatory route.
Just as criminal laws don't end all crimes, copyright laws and anti-scraping regulation won't end all scraping. But it will greatly reduce it and limit it to rogue actors. Two examples I can cite here are the laws against email spams and laws against unsolicited marketing calls - they had a definite impact in reducing both (even in India, from where I am, where implementation of laws are often lax).
I basically agree that the idea should be to reduce, not eliminate, bots.
However, a big difference with crimes involving the internet is that they can be launched from anywhere. In the real world, I can't steal from someone unless I'm physically present in the same country as my victim. On the internet, the US could outlaw scraping and Russia would keep doing it.
The thing why Cloudflare got invented isn't AI scrapers. These are just the latest development... the original reason why Cloudflare got created and why it experienced such a meteoric growth is DDoS and botnets.
Yes. We need regulation in the AI space. But it will be useless as long as bad actors aren't held accountable - and a lot of the bad actors aren't in our jurisdictions. You got hacked devices all over the world in giant botnets, controlled by Russia, Chinese, Iranian and North Korean actors. You got Chinese AI scraper bots as China is heavily investing into training their own models. You got Indian, Filipino and Myanmar-based scammers.
And frankly I have no idea how to get all of that under control. As much as I'd like to see sanctions against both domestic and foreign enablers of abuse (which includes residential ISPs) - it's going to be one giant ass whack-a-mole game.
Unfortunately, I think the solution will be invite only services, communities, etc.
Someone needs to invite you to have access to it.
If you host your own blog then that might be okay to have public access, you would want everyone and everything to read it.
But if you're hosting your own photos, we might need tailscale like services to only allow certain people to access that.
Just implement caches, add indices to your DBs, use CDNs. Servers are very fast nowadays, have quite a lot of RAM and can handle huge amount of clients. No need to implement this anti-bot bullshit, it is mainly marketing, providing solutions to a problem which doesn't exist for most websites.
And identifying a bot that is acting on my behalf. Claude go search this topic is basically the same as Googling something and clicking on the results. Human driven AI searching needs to be in a different box than AI scraping for training data.
Hopefully it stays that way; "a bot acting on my behalf" is still a bot. At least it's often a well-behaved bot and uses a user-agent that can be detected and blocked.
You don't need a non-rooted phone to pass captcha checks, I have a rooted phone and can pass the captchas that ask you to scan a qr code. But I doubt phones without google services would manage.
Remote attestation should still be possible with a rooted phone if phone manufacturers weren't so shit. If the attestation happens at hardware level, it doesn't matter what programs or kernels you're running.
They are not a problem unless you "believe" it is a problem. I estimate around 20-25K hits to my website from bots per day and I have all cloudflare protections disabled. Any decently optimized server should be able to easily handle that. (it's roughly 1 request every 3 seconds).
Yes and that is just the bot background radiation of the internet. I run a primary source of information site and these botnets are aggressive to a DDOS level. All to do some sort of scraping. Because they have sophisticated enough tactics to DDOS us if they wanted to. However I am not sure their objective as they have wasted enough of our resources to have scraped all our content 1000s of times over. That 25k traffic is a couple of minutes for us. And that adds up. 80-90pct of our traffic is this
What resources are you concerned about? An n100 minipc should be capable of serving something like a blog at 20k+ requests/second (or saturating its network).
Sure. But 25,000 (bot hits / day) x 50kB (static webpage) = 1.25 GB / day / page x 30 (days) = 37.5 GB of data transfer / month / page. And that's assuming a static resource - if any of the resource is dynamically generated by the server, there's a CPU and memory cost too. Overall, even if you treat the impact on the server as "negligible", it's still an unnecessary waste of resource was the point I was trying to convey.
Do it like plane tickets do, tie a ticket to an identity + buyback up to a week or so before the concert in case someone wants to cancel (or authorize the transfer and capture only a week before). Ask for ID and ticket at the entrance.
- behavioural fingerprinting
- ja4
- IP rep
- queue mechanism
- card country to IP country checks
- app attestation
- custom metrics based on knowledge of past scalpers
It's hard but it's not impossible. You can make it very inconvenient for scalpers. They need to poll at volume so their behaviour is very much detectable. A hard stance is required on IP rep, especially for more in demand concerts.
It's either that or you tie tickets to government ID like in France. If the arbitrage opportunity is more than the cost of automation then someone will exploit it.
I'd simply check filling speed, even with browser's autocomplete humans are slow due needing click submit.
Then when it's "processing", do them in bulk and prioritize slower users. There's huge opportunity do bot checks after checkout without affecting user experience.
Also on product launches you could add unique field which requires user to input, for example that way bots can't prepare for launches.
Adding noise to a canvas element is a mistake anyway. It means you can't develop a proper paint program using web technologies because your browser will mess with the image.
Brazenly requiring the abuse of a browser feature's intended use against the user. What an age.
I'd like to hear from someone who worked on WebGL and how they feel about their ambitions being utterly subverted. Remember when the dream was playing games i. the browser?
They use all kinds of obscure APIs, which you'll learn if you're privacy/security conscious and disable random web APIs that are of no use to YOU as a web user, but only can ever serve the people who serve you stuff or want to hack you or track you.
Normally websites feature test and just skip using obscure disabled APIs, or more likely, websites don't use those APIs at all or only tracking scripts use it, which are already optional usually.
Problem with CF is that if you want increased security they'll prevent you from gaining it everywhere, even on sites they don't protect, or prevent you from accessing services even the ones you paid for. Browsers don't allow disabling APIs per domain, so you're either at risk everywhere or you're blocked from accessing a lot of things for no particular reason.
I'm no CF advocate but those random APIs are literally what differentiates people running Chrome on their computer versus a bot operation with a load of containers. Kubertnetes clusters don't have GPUs. This is why it's used in bot detection (I use brave with no hardware acceleration and I'm captcha everywhere)
For the malicious bot authors, if WebGL is a "free pass" so that their browser is not detected as a bot, they'll simply switch to a chrome based browser such as CloakBrowser, which already passes CloudFlare Turnstile.
So no real benefit for bot detection here. Just a privacy nightmare for everyone else.
Firefox has so much built-in tracking it seems they want to push me to build my own browser. For example every time you open the settings there are several ways they are sending out pings to certain extensions.
Also by default addons.mozilla.org is a privileged site so of course they include google tracking in it and they get the proper fingerprint no matter what you have configured.
If you are this motivated (I am!), how about joining forces on Konform Browser? Radio silence and remote third-party integrations disabled by default and generally sane and conservative defaults respecting old-fashioned notions like individual consent and data-protection regulations.
Aside from general dev, could use a hand in bringing it to more platforms (mobile and flatpak are frequently asked) and taking a closer look at fingerprinting protections and what's currently tripping up the turnstile.
Your very sarcastic, uninteresting comment getting downvoted is not an indication that forum isn't intellectual. It's an indication that you aren't behaving intellectually.
I wondered about that too. So they allege that bots
require that everyone now has to ID to the big service
providers. Very dystopian situation. Skynet is currently
winning the war.
A better solution would be to make webgl, webgpu and (especially) webrtc have some sort of prompt before they can be in any way used in that fashion, but this will absolutely destroy web ux Windows Vista style.
It's about explicitly deciding to allow certain capabilities on a per-website basis. No major browser allows defense-in-depth via fine-grained website permissions.
Even simply changing the user agent was sabotaged at Firefox, and choosing one user agent per domain is wishful thinking.
Fingerprinting is just an implementation, banning it will just drive these companies to invent new tricks. That's why the GDPR doesn't specify any technical tracking methods, whether you're using cookies or fingerprinting or a camera drone looking at the user's screen, tracking without consent or good reason is banned.
I doubt politicians care much about fingerprinting, though. They're more afraid of actual businesses getting attacked by bots than they are about Linux users with weird setups not being able to access some websites.
>Turns out it's because Cloudflare wants to have a fingerprint of your device via WebGL, the only reason for doing this would be tracking.
> So Cloudflare just banned all WebKitGTK browsers as I guess they put an exception for Safari.
This is false. I ran firefox with:
* hardware acceleration disabled (so software renderer, nothing to fingerprint)
* resistfingerprinting enabled, including letterboxing with default window size
* webgl disabled
* VPN enabled
* In a Windows VM
By all accounts this should be the most suspicious fingerprint ever, but turnstile happily lets me through. If they want to track people, they're doing a pretty bad job. My guess is that OP's browser is getting banned because his WebKitGTK has a weird fingerprint, not because of webgl or whatever.
> Such things are blocked in WebKit, and have been for years. Meaning it's tracking so awful that even Apple would block it, and as far as I can tell it's not the kind of privacy protection you can easily disable in it.
This is also false. Webgl fingerprinting works just fine on Safari. They might try to mitigate it by adding some noise, but that's not so different than what firefox does, and is certainly not "blocked".
I think your comment is also making plenty assumptions..
Official Firefox can be leaky unless you build it yourself with some build-time changes or use a fork with such[0]. Am I guessing right that you still have Webcompat, RemoteSettings, and Nimbus enabled still? How do you know a compatibility intervention isn't causing your browser to open the kimono just enough to "unbreak the page"?
> My guess is that OP's browser is getting banned because his WebKitGTK has a weird fingerprint, not because of webgl or whatever.
My guess is a different flavor of the same: Not matching an expected fingerprint (simplified: whitelist vs blacklist approach) combined with other factors.
[0]: I'm currently aware of Tor Browser, Konform Browser (am dev), Mullvad Browser, and to a certain extent Waterfox, LibreWolf, and r3df0x doing that.
>Official Firefox can be leaky unless you build it yourself with some build-time changes or use a fork with such[0]. Am I guessing right that you still have Webcompat, RemoteSettings, and Nimbus enabled still? How do you know a compatibility intervention isn't causing your browser to open the kimono just enough to "unbreak the page"?
> My guess is that OP's browser is getting banned because his WebKitGTK has a weird fingerprint, not because of webgl or whatever.
So why is Cloudflare saying the author got blocked because of WebGL?
> > Such things are blocked in WebKit, and have been for years. Meaning it's tracking so awful that even Apple would block it, and as far as I can tell it's not the kind of privacy protection you can easily disable in it.
> This is also false. Webgl fingerprinting works just fine on Safari. They might try to mitigate it by adding some noise, but that's not so different than what firefox does, and is certainly not "blocked".
Yep. Cloudflare and cloudflare's customers don't care about blocking people that use non-standard browsers (or accessible browsers, or feed readers, or whatever). Using cloudflare defaults is basically saying, "Only major corporate browsers released in the last year or two can access this site."
I don't want to defend them, because they gate away a good chunk of the internet with their "bot protection", but unless you do PoW (which is also ecologically a nightmare), probably fingerprinting is the way to go - completely destroying the privacy of everyone involved.
Cromite, a privacy conscious fork of Chromium for Android, has constantly issues with CloudFlare Turnstile [2] because they (Cloudflare) try to fingerprint it in multiple ways in order to pass the challenge. The only way to get it to work would be to join the CloudFlare Browser Developer program - which requires signing an NDA. Rightfully so, the project maintainer didn't want to do it.
If you want to see the extent of what CloudFlare does to fingerprint the browsers, just have a look in the issue [2] and see which flags need to be disabled in order to allow CloudFlare to pass the challenge.
I understand both sides, but at least CloudFlare could be flexible enough to fall back to PoW instead of just blocking people from sending forms or accessing websites...
[1]: https://github.com/Danny-Dasilva/CycleTLS
[2]: https://github.com/uazo/cromite/issues/2365
More to the point, these systems actually help scraping because proof of work unlocks essentially unlimited scraping, in my experience.
That said - from my experience on the other side, sure you can’t stop people like me or you, but you can stop 99% of the others. That’s more than worth it operationally.
It sure seems to keep me, the casual visitor, far away from just about any site they "protect". I have zero desire to alter my browsing configuration or use extra tools to get around turnstile, I'd rather not even visit the site in the first place.
Until your bank, airline, and tax ministry start using them.
Tools are inherently amoral; only people can have motives we can celebrate or condemn.
turnstile is not a public good, it's a private product, promoted to private entities that want to achieve a certain outcome that is beneficial to them privately.
The mass surveillance is a side-effect - an externality that cloudflare does not have to pay for (but we as netizens pay collectively).
It is the role and responsibility of gov't to regulate away externality (or make those who benefit from it pay a cost somehow, to equalize said externality). Unfortunately, like with climate change, nothing has been forthcoming, and only a few people care about the actual damage enough to even talk about it.
So it will go on, and the masses do not have a say.
Large companies and banks see >95% fraud on sign in / sign up flows. It’s a constant battle and the law of large numbers says even a tiny false negative rate can be catastrophic.
A bogus GCP or AWS or Azure account costs those companies hundreds to thousands of dollars. I don’t know what the average loss is on fraudulent bank signins, but probably on that order. And there are millions, sometimes billions of attempts per day.
I worked at a tech company that used an off-brand, truly awful captcha provider. Think “drag the mammal to the habitat it lives in, avoiding the wiggly lines”. When this awful provider went down (frequently), we fell back to recaptcha. Fraud rates were 100x higher in those minutes-to-hours outages. Though of course real users were also able to get in at higher rates.
Those might ignore it, but there are always alternatives.
Cloudflare only exists in its current form because banks and such already enthusiastically accepted that trade off.
Can you expand? I don't see a problem with some napkin math. 5W load for 2 seconds is 0.002Wh (we have to let smartphones pass and not by doing PoW for 10s of seconds). 8 billion checks a day for a year = 8GWh.
In any case, according to some napkin math done by Kimi 2.6 (which by itself is probably already consuming more than all of my PoW challenges for the upcoming 5 years) - the situation looks incredibly in favor of PoW: https://www.kimi.com/share/19e7ef40-a432-8912-8000-0000b4a71...
Which makes me wonder why CloudFlare isn't switching to this already
If some computation is "useless" but it serves it's purpose, it's not useless.
The reason why bitcoin network expends so much energy is down to tokenomics, not the system of PoW itself. At equilibrium we expect the power usage to be (blocks/hr) x (BTC/block) x ($/BTC) x (kWh/$), so it's a function of the BTC price and emission rate.
PoW in other context has way different driving factors. In this case, the marginal improvement of fetching the site again for AI bots isn't enough to cover the PoW cost. The PoW cost is outweighed by the net bandwidth cost of all the parties.
I think we have to expand the definition of stupid to include things that work but have net negative externalities. Not sure where PoW falls in that way of looking at things, but we should at least consider it.
(Thinking about it, Captcha is PoW, just theoretically work by the human)
I don't think I average even 2 captchas a day being terminally online, so 10 across every soul in the world sounds way too much for me. (we're ignoring bots it's meant to deter?)
> it's hard to see that such a modest energy cost would dissuade any attacks.
It's not against targeted attacks, but scrapping.
And not about energy cost, but available compute power -- it requires scrapper to use browser with JS (or time commitment to reimplement PoW outside of JS), limits their request rate by CPU core count.
You're mixing up checks, fingerprinting, and PoW with a captcha being triggered because those didn't pass. The less abnormal your setup is, the fewer captchas you'll get.
I agree with the rest of what you said.
Also I think you mean "scraper" and not "scrapper".
Only as long as legislation and law enforcement is off the table. Almost like we have those because everyone doing their own policing is not a reasonable way to run a society.
It's either proof-of-humanity (increasingly hard to get in this day and age, particularly if accessibility is a concern), proof odf identity (even worse) or proof of system integrity, which is the least bad out of all the terrible options.
your doctor seeing you naked does not destroy your privacy, it's your doctor sharing the photos with everybody that does. i.e. it problem here is that intermediaries like cloudflare don't work for you, they work for somebody else or sell the data themselves.
They also gate away a good many people with their "bot protection". I am extremely worried about how so many seem to have outsourced the control over who can access their websites to a company, with no second thoughts whatsoever.
No, we don't know. I honestly do not understand the problem. I run websites, both static and non-static. Granted, my sites aren't exactly the most popular internet go-to destinations, but I should be seeing this DDoS too, right?
I do see lots of requests. Nothing that any modern system can't handle. Computers are stupid fast these days. Unless you are doing something unreasonable, it's really hard to even notice this "extra load".
I understand there are sites for whom this causes problems, but I think these are rare and could be optimized not to do unreasonable things.
I think too many people are annoyed by AI companies (arguably understandable position), look at their logs and speak of "hammering", "DDoS" and "extra load", while in reality it doesn't matter much.
I’m with OP: I don’t like this but the alternatives all look like the death of the open web.
The person you're responding to already said they ran a modestly sized site. What actual scale opens one up to abuse? If only the top 1% of sites need it, then it seems silly to say "everyone" needs it.
I adopted Cloudflare because it was getting DDoSed by the AI crawlers. I'm pretty sure all of them are vibe coding their crawlers and don't bother adding rate limiting as a requirement.
I've spent some effort on optimizing my sites, but most of the effort was focused on avoiding unreasonable (stupid) work. Do I need a session for every request? No, I don't! Do I need a database fetch for every access to my homepage? No, I don't! Is it a problem to actually load all of my static content in all supported languages (24) into memory and serve it from memory? No, it isn't!
I use Clojure behind nginx on the server for my sites. Oh, and I also pre-compress all static assets to Brotli, so anything that handles brotli gets a static file served directly from nginx. I also use immutable assets with unlimited caching semantics.
Really — the problem is that we've grown lax and our software has become bloated, slow, and with unreasonable code paths. If every page fetch does 12 database accesses and runs through a slow interpreter, that is surely going to be a problem.
Not saying I'm not annoyed by the scraping; I am looking to block them, but I'm also not going to put the site behind the gatekeeper. If anything, Cloudflare must love AI scraping now for the same reason AV companies love malware.
Now, if you are running a PHP stack...yeah, maybe that's the problem right there.
I wonder if we should stop fighting this and instead create an API specifically for this purpose? Or, a central repository that you could send your data to and say to anyone wanting to scrape, "safe yourself some time and just get my data from this other place"
People shouldn't have to be experts or provision a larger server to run a UGC service that can withstand the sort of 30x more traffic I'm seeing from AI bots. Or rather, you didn't render the argument for why they should have to do that if they can just use CloudFlare's free tier.
Either way, it's easy to have all the answers when you've never had the problem.
What matters most is usually how much there is to scrape. If you have like 5 pages that's nothing. For forum like websites where each thread, each user profile, etc. gets scraped that's when traffic increases. I just let them have at it with no issues though, computers are fast.
The usage is extremely quick, and follows easy-to-spot patterns. We noticed a spike in bounce rate.
They never come from Google, and the bad programmed ones just crawl several pages at a time, faster than a user could do.
Then there's the crazy spikes in visits from specific countries, pretty much scraping the entire content. Often from pools of IPs. In some cases had 30% unexplained (meaning: it wasn't viral or a marketing campaign) random sustained increases in traffic.
There's also the fact they don't interact with the complicated widgets, so zero XHR requests other than analytics pings.
They also don't cause spikes in Google Analytics, so I assume it's blocked, but they show up in logs and in the internal analytics.
It's not enough to DDOS the website at all, but it's a lot of noise in statistics that we gotta learn to filter.
I’ve triggered this kind of “bot protection” right here on Hacker News many times. I did that by having a bunch of Hacker News pages open and then closing and reopening my browser. I’ve also triggered it by opening a bunch of links in the background too quickly. I’ve also triggered it by reading the article, then clicking back and upvoting/favouriting too quickly. I’m also located in Singapore, which people have started to advocate for blocking here recently.
A single non-bot legitimate user can easily trigger these kinds of heuristics just by using the site in a way you don’t expect. This can affect some users disproportionately more than others, e.g. disabled people who need to use assistive technology.
What I mean by "too fast" is opening 50 pages in the span of two or three milliseconds.
Either way, I'm not blocking. The CDN is handling the traffic alright.
A) you'd have to open >200 tabs, and B) if any tab solves the proof-of-work, any that are still waiting to do so reload in the background.
It's not a hostile DOS in the traditional sense (I've mitigated a few of those) - no "pay us to make it stop", no pattern to the requests other than "fetch every unique URL a few times".
It wasn't happening until financial incentives to gather large datasets for AI training appeared.
Bad actors (using residential proxies & claiming to be a real browser) mostly showed up after folk started blocking ones that identified themselves as AI scrapers.
It's obvious to blame AI training because there's a shortage of better explanations. Who else would be paying for these (expensive) residential botnets, only to use them to (eg) web-scrape wikipedia (which offers free downloads of its content in a structured format)?
The simplest explanation of the technical behavior is "a bot coded to follow every link it sees & save the results", and the simplest explanation of the motive to run such a bot is "to train a large language model".
"use Cloudflare to make it stop"
Cloudflare are merely the cheapest of the bunch.
that's just ~17 req/sec
That's "cheap VPS running wordpress" level of traffic
It's phenomenal how inefficient the WordPress/WooCommerce stack is.
Though the main issue I'm seeing is credit card testing, not scraping.
And I'm ideologically opposed to using a CDN (because it shouldn't be needed for such a small site!) so it's somewhat a self-inflicted problem...
It's easier and better to just block 0.0.0.0/1 half of the time, and 128.0.0.0/1 for the other half of the time. Switch every day at noon.
Bot traffic will be cut by 50%, and humans are all treated equally! It's a total win!
It's also not always easy to do. I run a small wiki which is fairly optimised, nearly every page manages at least ~3k rps on a small VPS. The only exception is the diff page which is ~150 rps. Optimising that while still giving good output isn't that easy, but the wiki doesn't have many users so that would be fine if it wasn't for the AI bots.
The AI bots ignore robots.txt and were initially hitting the site with ~1k rps crawling every combination. Even that would be manageable as there's currently ~150,000 combinations, except they kept re-crawling the whole lot each day. The server could manage it but it was a massive waste of resources.
They were using residential IPs and only sending 1 request from each IP making it impossible to block. In the end I gave up and put a Cloudflare challenge in front of it. I don't want to use Cloudflare but the alternative is forcing users to login to view diffs or remove them entirely.
Similar to the one SQLite had: https://www2.sqlite.org/forum/forumpost/7d3eb059f81ff694?t=h
Each IP only makes ~1 request though so easy to detect after the fact.
I guess they will run out of IPs at some point so maybe if I had logged each one forever and shown a challenge only to them, it would have fixed it eventually. Just depends how big their pool of IPs is.
In any case, such labyrinths of expensive dynamically generated pages are no excuse for subjecting people requesting the start page to bot checks.
I did try removing some of the links without success. I guess once they have them they just keep checking.
And if you're thinking about blocking all of China, while this particular AI bot didn't use them, a bunch of other ones I've encountered use VPNs and hacked clients worldwide.
There are. They're not. They can't (without significant effort)
I haven't had any problems with Firefox so far. Why do you say this?
So delegalize it. Strip searching everyone to paper over the fact that the societal contract has been broken only delays that.
The bad guys don't pay that much. And sometimes the bad guys actually use the IPs of other people (botnets on residential IPs) and don't pay anything at all.
You can easily calculate which IPs/networks bots are using by looking at where most traffic comes from and who requests lot of pages with non-human speed.
Never needed it. Just put the worst offenders in penalty bucket and that's usually enough
Although the cynical part of me says that this will result in malicious actors trying to trick agents into giving out a bunch of micro payments. There are counter defenses that can help detect and compensate for that, but perhaps the best we will be able to do is prompt user with the default agent recommendation.
I know some actual users get blocked. But the amount of spam we get without it, the amount of bot traffic simply overwhelming the server... It is just too much.
Recently I also hard blocked all IPs from china Singapore India Pakistan Russia and whole of africa. Do I want to do it? No. But the amount of bot traffic and corresponding spam is a bigger problem :(
At least for China, I imagine most of the real humans might use a VPN anyway
One of the core tenets of system design is Availability. If your service is not available - if your forms are blocking legitimate users - then why are you pretending to have a form submission feature at all? Just to frustrate users?
The service won't be available to anybody because of overwhelming unwanted traffic. Now it's available for most potential users. You're speaking econ 101 when everyone else has played out iterated prisoner's dilemmas.
Turns out that people have a tolerance for a non-zero amount of work, but still have a limit.
Suggesting "turn off your website" is does not account for the desire to also provide some access.
Treat people who host content as humans, just as we must treat users as humans. There are tradeoffs, suggesting "shut down your website unless you provide access everywhere" is worse on all fronts for everyone.
Maybe, maybe not.
If block-heavy websites shut down entirely, we lose some content, but other content moves to block-minimal sites and the average user might be able to access more.
Also if there's no blocking crutch, and people get pushed into shutdown and are mad about it, they might fight harder for anti-spam technology and legal enforcement, which could improve the situation.
Because those are the only two countries that we've ever in the life of our business, had a legitimate order from.
It prevents the majority of credit card testing, but it is tempting to apply it to the whole site to reduce traffic and server load.
How many people do you think are browsing with a weird enough config (eg. custom browser like OP, or some weird config like firefox with fingerprinting protection on a raspeberry pi) to trip cloudflare's protection?
We had all of our Devs Pixels get blocked, and after talking to CF, it was because Internet archive was rebooted their scraping farm, all the devices stampeded and overwhelmed the known bot safeguards, and those tags were added across the board. CF gives sites the tools to tune what is getting blocked, we bumped the sensitivity down to 25 and haven't had many complaints (despite having a very vocal community)
The most common complaint is users' IP address getting blocked because of compromised devices
And no, it wasn't due to a country-based block selected by site operator.
At least let me complete a "prove you are human" challenge or something, but don't outright ban my IP address?
It takes very little for CF to consider you "weird".
In my experience what really makes it loop every single time though is JShelter. CF doesn't like having your fingerprintable data bits messed with.
There are legitimate uses for non-instrusive, ethical and legal scraping, but some of us have had to resort to extreme measures:
https://roundproxies.com/blog/bypass-bot-detection/
[1] - https://blawg.nochan.net/b/Internet-Crap/20260522-Maybe-AI-B...
Yesterday cloudflare blocked me from visiting the MX-Linux site ... including an old browser with -no- protections ...
I have to wonder - assuming these sites are paying CF for this 'service' - are they getting a list of all the fejected IPs?
As someone responsible for mitigating card testing "attacks", account harvesting, and DDOS attacks..
It is unfortunate, but the ISP industries(from telco up to transit) and CC industries aren't providing a lot of great options. This idea that people are doing things "without a second thought" is usually false when it comes to businesses.
1. If X% of the population gets wrongly branded with the scarlet letter B[ot], how do they appeal and get it fixed?
2. How will sites notice and know if their choice of "bot protection" is losing them X% of users/customers/job-seekers etc.? If it's a really robust system, they'll never even see the complaints either...
3. If everyone does detect that something is awry, will it be such a monopoly that there's no choice but to let it happen?
I think the Web is on its last legs, anyway. Generative AI and LLM-instead-of-search has destroyed what little value remained.
I'm glad to have known the true internet before its demise. Truly one of the wonders of humanity.
Firefox with a non-default profile can be created like that:
And you can launch it like that: So, given that /usr/bin/firefox is just a shell script, you can If you use an icon to run firefox (say, /usr/share/applications/firefox.desktop), you'll need to do copy/adjust line for the icon.Of course, "./firefox" from examples above should be replaced with the actual path to executable. For default installation of Firefox the path would be in /usr/bin/firefox script.
So, you can have a separate profiles for something sensitive/invasive (linkedin, cloudflare, shops, banks, etc.) and then you can have a separate profile for everything else.
And each profile can have its own set of extensions.
Edit: I RTFA'd, containers can't adjust `privacy.resistfingerprinting`. Boo
I'd argue, that for some, CLI path is actually cleaner.
You see, the way described above creates entirely separate points of entry, and you don't have to go to the central menu to launch specific profile.
It eliminates one step (Profile Manager, about:profiles or whatever) allowing you to get faster to the desired profile - same way you'd launch a default profile.
It's logical separation too. It's like separate browsers from UX standpoint (they do use the same distribution though ...unless they aren't - you can configure different distributions for different profiles - nothing stops you from that).
I'm just leaving the information about the gui option to other who may not be aware that it can be done from the gui too, and think its difficult to do in Firefox.
(That said, I still keep separate machines. One for doing "official" things, the other for everything else)
I think this was as recent as 25 years ago?
Recently they added some new UI. There was and still is (I think) classic Profile Manager UI, which you can launch with
or access UI in about:profiles.But you don't have to use any of those anyway - see my comment above (a response to parent).
does it? same binary, same machine, same display, same 781 other heuristics.
Well I mean maybe it wasn't useless 2 years ago, but in the age of AI it definitely is.
Bot protection with fingerprinting is just an illusion. Any signals like this which is on client side can be spoofed by an above average person. Fingerprinting is just way to consolidate the market for advertising business. Assigning Reputation to residential IP addresses and commercial blocks is is another approach to achieve the desired result. Providers would be a lot more careful to allow their IP addresses for misuses, however turns out that it would bring down the DDOS business on both sides, attackers and protectors.
Ironically, more than often its the same companies that invest in building their own bots and finding ways to stop bots from other companies.
At the upper bound, fraud can always be committed by paying real people with real accounts to perform the desired action in a way that is 100% truly indistinguishable from organic. There's fundamentally actual prevention technique at the limit.
So the entire game is only "increasing the costs until it's not viable ROI", not "holistically prevent", which is why fingerprinting is a relevant technique here.
As per cloudlare's own report, about 78% of the DDOS attacks are at the network layer where the fingerprinting technique is not useful.
DDOS is done against targets for certain reasons, most businesses are not even viable targets for everyone.
However letting everyone being fingerprinted on the pretext of solving the DDOS is where the privacy gets compromised (not much of it is left though). Some search engines did it indirectly by letting people use tag managers for free in their website and then utilize the data for their advertising business.
Relatively the end game is same, its just how these companies are approaching it.
The WebGL fingerprinting thing is cute, too. I guess it'll buy them some time since off-the-shelf solutions are going to probably not handle this well yet. That said, as long as the reward for bypassing turnstile and other anti-bot protections remains high, these things really can't do much. A decently resourced adversary can probably come up with a dozen different approaches to make this less useful. Without really looking into it much, my kneejerk is you could probably tweak Mesa to have deterministically random behavior for whatever edge cases it looks for, but you could also just have lots of different GPU/driver combos to proxy to. The web gets less open, but in an asymmetrical way. If you really have an incentive to keep botting, you'll surely find a way.
The next step is to fully give up and just essentially implement WEI. And then the bot problem disappears?
Nope. Botting will still hold tremendous value, so likely there will be many crafty workarounds and bypasses over time. And there will be countermeasures for those and workarounds for that. Guess we'll start to find out who actually has the resources and incentives to keep botting in this environment.
So what's the real solution? Well the most obvious thing to do would be to make botting less valuable. Can we? I dunno. It may have been a mistake to move so many important things to the Internet after all. I mean, some of this is just threat actors catching up with what's possible and was inevitable to begin with. But, some of it is just trying to find solutions to problems that were unnecessary to begin with. Or failing to implement solutions despite an obvious need to do so.
There are a lot of threads to pull on, here. Account takeover still holds tremendous value to threat actors. Why? In my opinion, it's because passkeys were a tremendous failure, no matter what adoption shows. If we wanted to just improve security for users, I think we didn't need to restructure the internet around another authentication mechanism that of course, provides attestation capabilities, we could've just improved on passwords. For more secure handling of passwords, PAKEs exist. Password managers exist. For anti-phishing, TOTPs exist. What if you could have the exact same passkey experience, but in such a way that everything can gracefully fallback to just passwords and TOTP, because they're the real keymatter at the end of it? Add a web standard that lets browsers and browser extensions hook into the login process, standardize PAKEs as part of the web. Cross-vendor syncronization? A problem easily solved if we ever wanted to.
Instead of that, we got the dumbest possible world. Passkeys are sometimes available, but often not. Can you sync your passkeys across devices? Probably, maybe they have blacklisted KeepassXC by now so maybe I can't :)
But a lot of stuff doesn't even offer me the option to use passkeys, so they still use passwords. Can I enter my password to log in still? No, of course not. See, I will helpfully get the option to enter my password, in addition to the option to use email or SMS, the most secure authentication scheme known to Man, but if I actually select password and enter my secure password from my secure password manager, what I get to find out is that the password option is actually password and email or SMS and there's no option to use TOTP. Oh, and you randomly get logged out for no reason sometimes.
Some of the bots will probably disappear. Like, whatever bot is throwing me several terabytes of nonsense traffic every month will probably eventually disappear since they're wasting so much bandwidth on doing literally nothing. I have no idea what the point is, but I know it can't be terribly valuable for them, and it's not terribly expensive for me. I'd love to know who the hell is doing that and why, though.
But since the web is ran mostly by crap companies like Google, it will never get its shit together, and we will get solutions like WEI and identitity verification to solve problems that were entirely manufactured (or caused by a significant lack therefore of) in the first place.
By virtue of incompetent and ignorant Devs and middle managers. Our by virtue of greed and maliciousness.
Yeah yeah never attribute to malice what can be explained by stupidity... This time no. It's both.
I hate what the anti scrapper mechanisms have become but it really is the lesser evil. The alternative for many small operators is to just completely shutdown.
For good reason. I've run that setting for ages but I kept having to disable it and add workarounds because websites would break in weird ways. Timezones in scheduling websites being messed up nearly made me miss a couple of appointments. There's no way to tell the user Firefox isn't broken without displaying a permanent banner like "if websites are broken in any way or you see weird glitches or your computer's time is wrong or fonts look weird or videos don't always work right, click here to disable fingerprinting protection".
Interestingly, Turnstile breaks with resistfingerprinting but works with fingerprintingProtection, I guess the latter takes this crap into account.
The reason for spoofing the time zone (to UTC) is that it is one of the many things used to fingerprint users. There is an unintended side effect however: a mismatch with the IP geolocation could out you as a VPN user even if no VPN is actually used.
When Youtube still supported trends, visiting the trends page from Poland with Safari set to English gave you really interesting results. Mostly intellectually-stimulating content from channels like Veritasium, with a smattering of reviews, trailers, music and focus soundtracks thrown in. Meanwhile, visiting that same page on (Windows) Chrome set to Polish gave you the typical "you won't believe what this man just did!!!" crap.
I somewhat expect breaking sites with strict settings, I don’t expect an still wide open tracking path.
That’s deceiving.
Websites already break often with the strictest protections enabled, adding a "super duper strict protections" mode will just lead to bug reports. Even more-than-bare-basic tracking prevention has HN threads full of comments like "doesn't work on <Firefox fork>" because they don't see the connection between fingerprinting protection, WebRTC/WebGL/WebGPU, and websites not working.
People who are willing to take that bet can enable it in about:config.
That’s what I‘m saying. They already break because of other effects of the strict settings, so what is the benefit of leaving resistFingerprinting turn off?
> There is no full immunity against fingerprinting.
There is 0 immunity if you don’t even try.
Strict means, do what you can, not do somethings strict other not so strict and others ignore completely.
Don’t call it strict if it isn’t strict
> Stronger protections that block more trackers, but may cause some sites to break.
That seems very reasonable to me. Anyone who wants more than that can turn on resistFingerprinting and live with the consequences.
People already expect sites to break, so why holding back?
This stupid "war against bots" is going to lead to the downfall of the Internet and effectively turn it into another walled garden where only "approved" (anti-)user agents are allowed. Don't fall for the nonsense about "AI scrapers" --- it's just a way to manufacture consent.
Those images also used to crash all the early GUI irc and chat clients that showed inline images without size checks...
Once you added a redirect rule for the IP to apache you'd just check your log and see the IP that was hitting you every couple of minutes poofed for a good few hours.
That's nuts. I suppose you had Webalayzer on a minutely cron job. It might have been drawing more resources than Apache itself!
I can't, because every request comes from a new IP!!!
Imagine you run a company register for a local government. You want to let people look up companies by their registration number (which they must disclose in all communications to you) to see if they're legit and whether any warnings have been raised against them. You don't want unscrupulous marketers to just be able to `SELECT * FROM companies WHERE type='nail_salon' AND city='london'`.
If you aren't super strict about scraping, some shadowy business in Neverland, completely unconcerned with following your laws, will build that database.
Is this data not public for some reason? I think it will not hurt if there are multiple copies spread between public offices and private companies. What really hurts is a private company hammering your webserver for their own profit. They should get their own copy.
Reg_no, status, no_warnings_last_12m
These are sad times we're living as far as openness of the web goes. People would have less of a scraping problem if their websites didn't ship with 20MB of JS.
Google bot is generally fairly well behaved, but this is not the case for all scrapers and it can cause significant traffic (and expense).
[0]: https://konform-browser.codeberg.page/
[1]: Most? All? Without any telemetry, relying on user reports and our own testing here.
Another case of the much predicted downfall of freedom due to "people who hide themselves must have something to hide, so they are automatically suspicious"
They send these emails you know? "CF saved you XXX Gb of data and protected your from YYY attacks". I have few high load web sites which I turned CF on for a while. Knowing my traffic pretty well, I can say these "CF saved you XXX Gb of data and protected your from YYY attacks" is absolute bullshit with numbers greatly exaggerated.
Since wwe can't catch them on this lie, they can put any number they like to make their "service" attrractive.
WebGL finger printing is just one of many things you need to do if you actually want to stop automation. There is no way round it other than requiring ID of some sort.
You were never entitled to it in the first place
I'll make sure to fail all cloudflare turnshit in the future.
That pref is there for the Tor Browser.
Also enabled by default for Konform Browser and Mullvad Browser, which borrow many of the privacy- and security-related patches from Tor Browser.
Internet Archive passed?
I would get locked out of the account on all devices after saying these things until I compeleted their turnstile. For many accounts I just never used them again.
I could go more into this, but im highly suspicious of Cloudflare and of course X/Twitter in this regard. Ive been reccomend people to follow on anonymous twitter accounts for people I went to elementary school with and havent spoken to in years and have no digital connection to. Its very weird.
If randomized canvas stuff was cracked down upon as a bot thing but now everyone with a copy of Firefox is doing it, maybe Cloudflare should just “legalize” it?
https://github.com/kkapsner/CanvasBlocker
https://jshelter.org/fpd/
What all security extensions do you run? After running into issues over the years, with extensions doing multiple things that fight each other, I switched to trying to block via ublock origin as much as possible, then prefer other extensions to just do one thing to extend coverage, like this one. Makes it much easier to troubleshoot/exclude/disable when it breaks something vs. fiddling in settings.
I don't use Anubis though. I just make my site not take five seconds to render a page so bots can overload it easily? It's not actually that hard?
ideally one would pick something a bit more forgiving than a linear function, to avoid penalizing too much users connecting from CGNAT
Nevertheless even for these high value cases, you can still argue that it disincentivizes the business model, it becomes less efficient.
But in principle I agree that there's no good answer to this, scraping _is_ useful and I bet most of us here had scraped something, it is AI company and their use of human's material for training without consent and return that led us to this (I know botting exists in forum since forum is a thing but it is easily solved by human moderators and keyword filter)
So it’s not quite as horrible as it sounds.
I have setting up Anubis for my own sites on my todo list. And I wish more people did it too. I don’t really mind waiting a little bit extra every now and then before the page loads. What I do mind is ReCaptcha asking me to click all the pictures with buses in them etc. And especially when I have to do it several times over before it’s happy. I’d rather wait a minute for a page to load than to ever solve a ReCaptcha again, if given the choice.
I don't know about you, but if a random webpage takes 60+ seconds to load, I just close it and choose to never interact with that site again (unless it's my bank, which is a real and annoying occurrence).
Some sort of decentralized trust web seems like another option, though less viable.
Anubis is active when a user agent looks like a web browser (e.g. contains the "Mozilla" substring every major browser uses). The reverse proxy serves an interstitial page that does a proof-of-work check, validated server side, setting a cookie if it passes.
This means a legitimate user won't constantly get the proof of work check, because they already passed it. But AI bots rotating through tons of residential IPs to scrape your forum or git forge or whatever will be slowed down.
Overall, I like the idea. It's unobtrusive, privacy preserving, and seems to be working out well for a lot of sites.
And there are just not enough sites using Anubis for the people and companies running the bots to care to do that.
If you do care bypassing Anubis is trivial.
They don't now, but enough "high value to the bots" pages turning on JS or complicated redirects will simply result in the bot authors adding JS execution or redirect following so they can continue "botting" the sites they want to scrape.
It's a hole with no bottom. Each one-up on the anti-bot side will eventually be handled on the bot side.
Nearly all of our sites are visiting by extremely tech literate folks, the exact type that may not be using Google Chrome or Firefox.
I use Cloudflare protection on all my website but only the account creation page uses Turnstyle.
https://abrahamjuliot.github.io/creepjs/
So if you want privacy, you have to accept poor and sometimes insecure services.
Yeah, this needs to be burned to the ground.
https://litter.catbox.moe/gaizpk692bhhs6b7.png
Cloudflare's demo page still treats that as a pass, but complains about it. As is often the case with Cloudflare, I expect that they'll then take no responsibility for sites that use more aggressive settings.
Which, to be clear, is the entire problem: given how much of the internet goes through them, they should have enough alternative signals as to wether you’re not a bad actor that are stronger than this specific one.
However, this also presents the problem that there’s barely any users in their base with your exact configuration, so getting any actual solutions might just take forever.
I keep getting the turnstile and having to click the "I a human" button.
Obviously this is terrible, but I think there's a possibility it's the least terrible option? Another option is IP reputation, which I think is worse. Or scanning a code with a non-rooted phone, which I think is even worse than that!
There isn't one, and pretending otherwise is nonsense because humans will always provide their credentials to something to act on their behalf.
In the limit you end up with Chinese phone farms.
Cloudflare, Google Captcha, HCaptcha etc. are all shitty technical solutions because, as we are all discovering, it comes at the cost of our privacy (i.e. our personal data may monetise these services) and / or our computing resource and time. If current copyright laws aren't sufficient to prevent this, we have to acknowledge the system is broken. The answer could be enhancing it with some kind of Digital Millennium Copyright Act (DMCA) -like laws, but in favour of the creators against BigTech or rogue actors.
- Web-scraping and copyright law - https://www.neudata.co/blog/web-scraping-and-copyright-law
- Why DMCA Claims Against Web Scrapers Face Long Odds - https://capstonedc.com/insights/why-dmca-claims-against-web-...
As for issues like bots overloading websites or using too many resources scaling laws will take care of it quickly, it’s not like you can’t serve thousands of RPS from a Raspberry Pi these days.
> we have to acknowledge the system is broken
The system is broken. It probably takes, what, 10 seconds or less to use a residential or foreign proxy, 6+ months to internationally track and prosecute a single offender? So like a million times more effort going the regulatory route.
However, a big difference with crimes involving the internet is that they can be launched from anywhere. In the real world, I can't steal from someone unless I'm physically present in the same country as my victim. On the internet, the US could outlaw scraping and Russia would keep doing it.
The thing why Cloudflare got invented isn't AI scrapers. These are just the latest development... the original reason why Cloudflare got created and why it experienced such a meteoric growth is DDoS and botnets.
Yes. We need regulation in the AI space. But it will be useless as long as bad actors aren't held accountable - and a lot of the bad actors aren't in our jurisdictions. You got hacked devices all over the world in giant botnets, controlled by Russia, Chinese, Iranian and North Korean actors. You got Chinese AI scraper bots as China is heavily investing into training their own models. You got Indian, Filipino and Myanmar-based scammers.
And frankly I have no idea how to get all of that under control. As much as I'd like to see sanctions against both domestic and foreign enablers of abuse (which includes residential ISPs) - it's going to be one giant ass whack-a-mole game.
Which sounds extremely difficult to differentiate
Here's a more real-world projection of the cost and server impact - The Bandwidth Cost of AI Crawlers: What Scraping Really Costs Publishers - https://aipaypercrawl.com/articles/ai-crawler-bandwidth-cost
You can forget about it. It is not possible. Simple as that.
It's hard but it's not impossible. You can make it very inconvenient for scalpers. They need to poll at volume so their behaviour is very much detectable. A hard stance is required on IP rep, especially for more in demand concerts.
Then when it's "processing", do them in bulk and prioritize slower users. There's huge opportunity do bot checks after checkout without affecting user experience.
Also on product launches you could add unique field which requires user to input, for example that way bots can't prepare for launches.
https://browser-compat.turnstile.workers.dev/
I'd like to hear from someone who worked on WebGL and how they feel about their ambitions being utterly subverted. Remember when the dream was playing games i. the browser?
Normally websites feature test and just skip using obscure disabled APIs, or more likely, websites don't use those APIs at all or only tracking scripts use it, which are already optional usually.
Problem with CF is that if you want increased security they'll prevent you from gaining it everywhere, even on sites they don't protect, or prevent you from accessing services even the ones you paid for. Browsers don't allow disabling APIs per domain, so you're either at risk everywhere or you're blocked from accessing a lot of things for no particular reason.
CF can't be bothered to feature test.
So no real benefit for bot detection here. Just a privacy nightmare for everyone else.
Also by default addons.mozilla.org is a privileged site so of course they include google tracking in it and they get the proper fingerprint no matter what you have configured.
Aside from general dev, could use a hand in bringing it to more platforms (mobile and flatpak are frequently asked) and taking a closer look at fingerprinting protections and what's currently tripping up the turnstile.
https://codeberg.org/konform-browser/source
this can mean WebContent process is crashing
I'm not good at creating petitions but can happily sign it. Also with stop killing games and anti-chat control.
I can imagine this can get a traction, if it's explained in youtube video to "normal" people.
And then legislation required those consent boxes back, so everyone built their own, instead of demanding that the default should be changed back.
Even simply changing the user agent was sabotaged at Firefox, and choosing one user agent per domain is wishful thinking.
I doubt politicians care much about fingerprinting, though. They're more afraid of actual businesses getting attacked by bots than they are about Linux users with weird setups not being able to access some websites.
b. Accept Only Necessary Fingerprinting
>Turns out it's because Cloudflare wants to have a fingerprint of your device via WebGL, the only reason for doing this would be tracking.
> So Cloudflare just banned all WebKitGTK browsers as I guess they put an exception for Safari.
This is false. I ran firefox with:
* hardware acceleration disabled (so software renderer, nothing to fingerprint)
* resistfingerprinting enabled, including letterboxing with default window size
* webgl disabled
* VPN enabled
* In a Windows VM
By all accounts this should be the most suspicious fingerprint ever, but turnstile happily lets me through. If they want to track people, they're doing a pretty bad job. My guess is that OP's browser is getting banned because his WebKitGTK has a weird fingerprint, not because of webgl or whatever.
> Such things are blocked in WebKit, and have been for years. Meaning it's tracking so awful that even Apple would block it, and as far as I can tell it's not the kind of privacy protection you can easily disable in it.
This is also false. Webgl fingerprinting works just fine on Safari. They might try to mitigate it by adding some noise, but that's not so different than what firefox does, and is certainly not "blocked".
Official Firefox can be leaky unless you build it yourself with some build-time changes or use a fork with such[0]. Am I guessing right that you still have Webcompat, RemoteSettings, and Nimbus enabled still? How do you know a compatibility intervention isn't causing your browser to open the kimono just enough to "unbreak the page"?
> My guess is that OP's browser is getting banned because his WebKitGTK has a weird fingerprint, not because of webgl or whatever.
My guess is a different flavor of the same: Not matching an expected fingerprint (simplified: whitelist vs blacklist approach) combined with other factors.
[0]: I'm currently aware of Tor Browser, Konform Browser (am dev), Mullvad Browser, and to a certain extent Waterfox, LibreWolf, and r3df0x doing that.
See my other comment, tor browser works fine too: https://news.ycombinator.com/item?id=48346659
fingerprintingProtection works fine on the other hand, but then again that's intentionally less intrusive.
So why is Cloudflare saying the author got blocked because of WebGL?
> > Such things are blocked in WebKit, and have been for years. Meaning it's tracking so awful that even Apple would block it, and as far as I can tell it's not the kind of privacy protection you can easily disable in it.
> This is also false. Webgl fingerprinting works just fine on Safari. They might try to mitigate it by adding some noise, but that's not so different than what firefox does, and is certainly not "blocked".
While I don't have an iDevice to try, the assumption that they are special cased is fair... because they are: https://blog.cloudflare.com/eliminating-captchas-on-iphones-...
(Yes, this is basically WEI in a shinier package.)
No idea. I can't even reproduce the error OP got with webgl disabled.
https://litter.catbox.moe/y42l22k97tgv96nx.png