How much of my observability data is waste?

(usetero.com)

120 points | by binarylogic 24 days ago

21 comments

karianna 24 days ago
Hard agree on the data waste, noise to signal ratio is typically very high and processing, shipping and storing all of that data costs a ton.
Previous start-up I worked on (jClarity, exited to Microsoft) mitigated much of this by having a model of only collecting the tiny amount of data that really mattered for a performance bottleneck investigation in a ring buffer and only processing / shipping and storing that data if a bottleneck trigger occurred (+ occasional baselines).
It allowed our product at the time (Illuminate to run at massive scale without costing our customers an arm and a leg or impacting their existing infrastructure. We charged on the value of the product reducing MTTR and not on how much data was being chucked around.
There was the constant argument against approach of always on observably or “collect all data JIC”, but with a good model (in our case something called the Java Performance Diagnostic Method) we never missed having the noise
[-]
- akshayshah 23 days ago
  In broad strokes, I think this is similar to Bitdrift (https://bitdrift.io) - though they’re focused on mobile observability.
  [-]
  - moderation 22 days ago
    And looks similar to Grepr [0].
    0. https://www.grepr.ai/
- stephenlf 24 days ago
  That’s awesome
jldugger 24 days ago
>Turns out you can compile tens of thousands of patterns and still match at line rate.
Well, yea, sort of the magic of the regular expression <-> NFA equality theorem. Any regex can be converted to a state machine. And since you can combine regexes (and NFAs!) procedurally, this is not a surprising result.
> I ran it against the first service: ~40% waste. Another: ~60%. Another: ~30%. On average, ~40% waste.
I'm surprised it's only 40%. Observability seems to be treated like fire suppression systems: all important in a crisis, but looks like waste during normal operations.
> The AI can't find the signal because there's too much garbage in the way.
There's surprisingly simple techniques to filter out much of the garbage: compare logs from known good to known bad, and look for the stuff thats' strongly associated with bad. The precise techniques seem bayesian in nature, as the more evidence (logs) you get the more strongly associated it will appear.
More sophisticated techniques will do dimensional analysis -- are these failed requests associated with a specific pod, availability zone, locale, software version, query string, or customer? etc. But you'd have to do so much pre-analysis, prompting and tool calls that the LLM that comprise today's AI won't provide any actual value.
[-]
- binarylogic 24 days ago
  Yeah, it's funny, I never went down the regex rabbit hole until this, but I was blown away by Hyperscan/Vectorscan. It truly changes the game. Traditional wisdom tells you regex is slow.
  > I'm surprised it's only 40%.
  Oh, it's worse. I'm being conservative in the post. That number represents "pure" waste without sampling. You can see how we classify it: https://docs.usetero.com/data-quality/logs/malformed-data. If you get comfortable with sampling the right way (entire transactions, not individual logs), that number gets a lot bigger. The beauty of categories is you can incrementally root out waste in a way you're comfortable with.
  > compare logs from known good to known bad
  I think you're describing anomaly detection. Diffing normal vs abnormal states to surface what's different. That's useful for incident investigation, but it's a different problem than waste identification. Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever? A health check log isn't anomalous, it's just not worth keeping.
  You're right that the dimensional analysis and pre-processing is where the real work is. That's exactly what Tero does. It compresses logs into semantic events, understands patterns, and maps meaning before any evaluation happens.
  [-]
  - zahlman 24 days ago
    > Traditional wisdom tells you regex is slow.
    Because it's uncomfortably easy to create catastrophic backtracking.
    But just logical-ORing many patterns together isn't one of the ways to do that, at least as far as I'm aware.
  - jldugger 24 days ago
    > I think you're describing anomaly detection.
    Well it's in the same neighborhood. Anomaly detection tends to favor finding unique things that only happened once. I'm interested in the highest volume stuff that only happens on the abnormal state side. But I'm not sure this has a good name.
    > Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever?
    I get your point but: if sorting by the most strongly associated yields root causes (or at least, maximally interesting logs), then sorting in the opposite direction should yield the toxic waste we want to eliminate?
  - pstuart 24 days ago
    Vectorscan is impressive. It makes a huge difference if you're looping through an eval of dozens (or more) regexps. I have a pending PR to fix it so it'll run as a wasm engine -- this is a good reminder to take that to completion.
  - nextaccountic 24 days ago
    But if you don't do anomaly detection, how can you possibly know which data is useful for anomaly detection? And thus, which data is valuable to keep
smithclay 24 days ago
Kudos to Ben for speaking to one of the elephants in the room in observability: data waste and the impact it has on your bill.
All major vendors have a nice dashboard and sometimes alerts to understand usage (broken down by signal type or tags) ... but there's clearly a need for more advanced analysis which Tero seems to be going after.
Speaking of the elephant in room in observability: why does storing data on a vendor cost so much in the first place? With most new observability startups choosing to store store data in columar formats on cheap object storage, think this is also getting challenged in 2026. The combination of cheap storage with meaningful data could breathe some new life into the space.
Excited to see what Tero builds.
[-]
- binarylogic 24 days ago
  Thank you! And you're right, it shouldn't cost that much. Financials are public for many of these vendors: 80%+ margins. The cost to value ratio has gotten way out of whack.
  But even if storage were free, there's still a signal problem. Junk has a cost beyond the bill: infrastructure works harder, pipelines work harder, network egress adds up. And then there's noise. Engineers are inundated with it, which makes it harder to debug, understand their systems, and iterate on production. And if engineers struggle with noise and data quality, so does AI.
  It's all related. Cheap storage is part of the solution, but understanding has to come first.
- nishantmodak 24 days ago
  Problem has never been the storage. Its running those queries to return in milliseconds - if its for a dashboard, an alert of your new AI agent trying to make sense of it.
gmuslera 24 days ago
Reminded me a note I heard about backups. You don't want backups, it is a waste of time, bandwidth and disk space, by far most if not all of it will end being discarded without being ever used. What you really want is something to restore from if anything breaks. That is the cost that should matter to you. What if you don't have anything meaningful to make a restore from?
With observability is not the volume of data, time and bandwidth used on it, is being able to understand your system and properly diagnose and solve problems when they happen. Can you do that with less? For the next problem that you don't know yet? If you can't because of lack of information or information you didn't collect, then spending so much may be was not enough.
Of course that there are more efficient (towards the end result) ways to do it than others. But having the needed information available, even if it is never used, is the real goal here.
[-]
- binarylogic 24 days ago
  I agree with the framing. The goal isn't less data for its own sake. The goal is understanding your systems and being able to debug when things break.
  But here's the thing: most teams aren't drowning in data because they're being thorough. They're drowning because no one knows what's valuable and what's not. Health checks firing every second aren't helping anyone debug anything. Debug logs left in production aren't insurance, they're noise.
  The question isn't "can you do with less?" It's "do you even know what you have?" Most teams don't. They keep everything just in case, not because they made a deliberate choice, but because they can't answer the question.
  Once you can answer it, you can make real tradeoffs. Keep the stuff that matters for debugging. Cut the stuff that doesn't.
  [-]
  - bluGill 24 days ago
    The problem is until I hit a specific bug I don't know what logs might be useful. For every bug I've had to fix 99% of the logs were useless, but I've had to fix many bugs over the years and each one needed a different set of logs. Sometimes I know in the code "this can't happen but I'll log an error just in case" - when I see those in a bug report they are often a clue, but I often need a lot of info bugs that happen normally all the time to figure out how my system got into that state.
    "disk getting full" isn't useful unless you understand how/why it got full and that requires logging things that might or might matter to the problem.
  - gmuslera 24 days ago
    There is a lot of crap that is and will ever be useless when debugging a problem. But there is a also a lot that you don't know if you will need it, at least, not yet, not when you are defining what information you collect, and may become essential when something in particular (usually unexpected) breaks. And then you won't have the past data you didn't collect.
    You can go in a discovering path, can the data you collect explain how and why the system is running now? There are things that are just not relevant when things are normal and when they are not? Understanding the system, and all the moving parts, are a good guide for tuning what you collect, what you should not, and what are the missing pieces. And cycle with that, your understanding and your system will keep changing.
stackskipton 24 days ago
As Ops (DevOps/Sysadmin/SREish) person here, excellent article.
However, as always, the problem is more political than technical and those are hardest problems to solve and another service with more cost IMO won't solve it. However, there is plenty of money to be made in attempting to solve it so go get that bag. :)
At end of day, it's back to DevOps mentality and it's never caught on at most companies. Devs don't care, Project Manager wants us to stop block feature velocity and we are not properly staffed since we are "massive wasteful cost center".
[-]
- binarylogic 24 days ago
  100% accurate. It is very much political. I'd also add that the problem is perpetuated by a disconnection between engineers who produce the data and those who are responsible for paying for it. This is somewhat intentional and exploited by vendors.
  Tero doesn't just tell you how much is waste. It breaks down exactly what's wrong, attributes it to each service, and makes it possible for teams to finally own their data quality (and cost).
  One thing I'm hoping catches on: now that we can put a number on waste, it can become an SLO, just like any other metric teams are responsible for. Data quality becomes something that heals itself.
  [-]
  - stackskipton 24 days ago
    I'd be shocked if you can accurately identify waste since you are not ultimately familiar with the product.
    Sure, I've kicked over what I thought was waste but told it's not or "It is but deal Ops"
    [-]
    - binarylogic 24 days ago
      You're right, it's not always binary. That's why we broke it down into categories:
      https://docs.usetero.com/data-quality/logs/malformed-data
      You'd be shocked how much obviously-safe waste (redundant attributes, health checks, debug logs left in production) accounts for before you even get to the nuanced stuff.
      But think about this: if you had a service that was too expensive and you wanted to optimize the data, who would you ask? Probably the engineer who wrote the code, added the instrumentation, or whoever understands the service best. There's reasoning going on in their mind: failure scenarios, critical observability points, where the service sits in the dependency graph, what actually helps debug a 3am incident.
      That reasoning can be captured. That's what I'm most excited about with Tero. Waste is just the most fundamental way to prove it. Each time someone tells us what's waste or not, the understanding gets stronger. Over time, Tero uses that same understanding to help engineers root cause, understand their systems, and more.
      [-]
      - nextaccountic 24 days ago
        I would like to just have a storage engine that can be very aggressive at deduplicating stuff. If some data is redundant, why am I storing it twice?
        [-]
        HumanOstrich 24 days ago
        That's already pretty common, but the goal isn't storing less data for its own sake.
        [-]
        nextaccountic 24 days ago
        > the goal isn't storing less data for its own sake.
        Isn't it? I was under impression that the problem is the cost storing all this stuff
        [-]
        HumanOstrich 23 days ago
        Nope, you can't just look at cost of storage and try to minimize it. There are a lot of other things that matter.
        [-]
        nextaccountic 23 days ago
        What I am asking is, what are the other concerns other than literally the cost? I have interest in this area and I am seeing everyone saying that observability companies are overcharging their consumers.
        [-]
        HumanOstrich 23 days ago
        We're currently discussing the cost of _storage_, and you can bet the providers already are deduplicating it. You just don't get those savings - they get increased margins.
        I'm not going to quote the article or other threads here to you about why reducing storage just for the sake of cost isn't the answer.
        [-]
        nextaccountic 23 days ago
        Well, that's a weirdly confrontational reply. But thanks
- xmprt 24 days ago
  The first step to solving this is correct cost attribution. And then once you do that, it's easy to go to org leads and tell them that their logs are costing them $X and you can save them 40% by applying these suggestions. They'll be happy to accept your help at that point. But if the costs are all on the Ops team, then why would the product teams care about any cost optimizations which just takes away development time from them.
peterldowns 24 days ago
Ben, you probably don't remember me but you hired me ages ago to help out with the Python client for Timber. Just want to say thanks for that opportunity — it's been amazing to watch you guys succeed.
Also, I've ended up being responsible for infra and observability at a few startups now, and you are completely correct about the amount of waste and unnecessary cost. Looking forward to trying out Tero.
[-]
- binarylogic 24 days ago
  Hey Peter, I absolutely remember you! Thanks for the nice comment.
  And yes, data waste in this space is absurdly bad. I don't think people realize how bad it actually is. I estimate ~40% of the data (being conservative) is waste. But now we know - and knowing is half the battle :)
physicles 23 days ago
I can't get over how expensive these observability platforms are.
Last I looked (and looked again just now), if we were to take all our structured logs from all services and send them to Datadog with our current retention policy, it would just about double our current IT spend.
Instead, we use Grafana + Loki + ClickHouse and it's been mostly maintenance-free for years. Costs under $100/month.
What am I missing? What's the real value that folks are getting out of these platforms?
hinkley 24 days ago
One of the problems described here seems to be that the people building the dashboards aren’t the ones adding the instrumentation. Admittedly I’ve only worked on one project that was all in on telemetry instead of using log analysis. And even that one had one foot in Splunk and one in Grafana, but I worked there long enough to see that we mostly only had telemetry for charts at least someone on call used regularly. I got most of them out of Splunk but that wasn’t that hard. We hadn’t bought enough horsepower from them that it didn’t jam up if too many people got involved in diagnosing production issues.
Occasionally I convinced them that certain charts were wrong and moved them to other stats to answer the same question, and some of those could go away.
I also wrote a little tool to extract all the stats from our group’s dashboard so we could compare used to generated and I cut I’d say about a third? Which is in line with his anecdote. I then gave it to OPs and announced it at my skip level’s staff meeting so other people could do the same.
[-]
- binarylogic 24 days ago
  What you're describing is very real and it works to a degree. I've seen this same manual maintenance play out over and over for 10 years: cleaning dashboards, chasing engineers to align on schemas, running cost exercises. It never gets better, only worse.
  It's nuts to me that after a decade of "innovation," observability still feels like a tax on engineers. Still a huge distraction. Still requires all this tedious maintenance. And I genuinely think it's rooted in vendor misalignment. The whole industry is incentivized to create more, not give you signal with less.
  The post focuses on waste, but the other side of the coin is quality. Removing waste is part of that, but so is aligning on schemas, adhering to standards, catching mistakes before they ship. When data quality is high and stays high automatically, everything you're describing goes away.
  That's the real goal.
- srean 24 days ago
  This.
  I also think that a lot of the waste can be done away with by using application specific codecs. Yes, even gzip compresses logs and metrics by a lot, but one can go further with specialized codecs to hone in on the redundancy much quicker (than what a generic lossless compressor eventually would).
  However to build these one can't have a "throw it over the 3rd party wall" mode of development.
  One way to do this for stable services would be to build hi-fidelity (mathematical/statistical) models for the logs and metrics, then serialize what is non-redundant. This applies particularly well for numeric data where gzip does not do as well. What we need is the analogue of jpeg for the log type.
  At my workplace there has been political buy in of the idea that if a long / metric stream has not been used in 2~3 years, then throw it away and stop collecting. This rubs me the wrong way because so many times I have wished there was some historic data for my data-science project. You never know what data you might need in the future. You, however, do know that you do not need redundant data.
dabinat 24 days ago
Observability vendors massively overcharge. I got tired of paying an ever-increasing amount of money per month, so my solution now is a self-hosted SigNoz instance on a cheap Hetzner box. It costs me $30/month and I can throw large quantities of data at it and it doesn’t break a sweat.
matanyall 24 days ago
It's so funny, I've never done a cost-benefit analysis of having "good monitoring" and then still not being able to figure out what broke and needing to pull in someone who doesn't need the monitoring at all because they built the thing.
[-]
- pixl97 24 days ago
  It's probably something along the lines of "Monitoring solves the problems you expect to have".
  For example you don't even question when you see latency going up on some service, you can see DB load going up, and you either manually, or script out another instance starting up.
  Monitoring all this stuff allows you to call the DBA/app team/etc 20 minutes sooner when you see some component screw off an you have no idea why. Hopefully that person on the app team puts in a new means of showing what the problem was if it ever happens again, then it turns into the first type of problem you never thing about again (or hope was actually fixed in the application).
binarylogic 24 days ago
I spent a decade in observability. Built Vector, spent three years at Datadog. This is what I think is broken with observability and why.
[-]
- otterley 24 days ago
  And how are you solving the problem? The article does not say.
  > I'm answering the question your observability vendor won't
  There was no question answered here at all. It's basically a teaser designed to attract attention and stir debate. Respectfully, it's marketing, not problem solving. At least, not yet.
  [-]
  - quadrature 24 days ago
    theres more information here https://docs.usetero.com/introduction/how-tero-works the link in the article is broken.
    They determine what events/fields are not used and then add filters to your observability provider so you dont pay to ingest them.
    [-]
    - otterley 24 days ago
      What’s the differentiation vs., say, Cribl? Telemetry pipeline providers abound.
  - binarylogic 24 days ago
    The question is answered in the post: ~40% on average, sometimes higher. That's a real number from real customer data.
    But I'm an engineer at heart. I wanted this post to shed light on a real problem I've seen over a decade in this space that is causing a lot of pain; not write a product walkthrough. But the solution is very much real. There's deep, hard engineering going on: building semantic understanding of telemetry, classifying waste into verifiable categories, processing it at the edge. It's not simple, and I hope that comes through in the docs.
    The docs get concrete if you want to peruse: https://docs.usetero.com/introduction/how-tero-works
    [-]
    - otterley 24 days ago
      I would contend that it is impossible to know a priori what is wasted telemetry and what isn’t, especially over long time horizons. And especially if you treat your logs as the foundational source of truth for answering critical business questions as well as operational ones.
      And besides, the value isn’t knowing that the waste rate is 40% (and your methodology isn’t sufficiently disclosed for anyone to evaluate its accuracy). The value in knowing what is or will be wasted. It’s reminiscent of that old marketing complaint: “I know that half my advertising budget is wasted; I just don’t know which half.”
      Storage is actually dirt cheap. The real problem, in my view, is not that customers are wasting storage, but that storage is being used inefficiently, that the storage formats aren’t always mechanically sympathetic and cloud-spend-efficient to the ways they data is read and analyzed, and that there’s still this culturally grounded disparate (and artificial) treatment of application and infrastructure logs vs business records.
- yorwba 24 days ago
  I'm curious about the deep details, but the link 404s.
  [-]
  - binarylogic 24 days ago
    My apologies, I fixed the link. So much for restructuring the docs the night before posting this.
    You can read more here: https://docs.usetero.com/data-quality/overview
    To loosely describe our approach: it's intentionally transparent. We start with obvious categories (health checks, debug logs, redundant attributes) that you can inspect and verify. No black box.
    But underneath, Tero builds a semantic understanding of your data. Each category represents a progression in reasoning, from "this is obviously waste" to "this doesn't help anyone debug anything." You start simple, verify everything, and go deeper at your own pace.
perrygeo 22 days ago
I don't know what it is about observability that brings out the over-engineering in us. I haven't actually measured it but I suspect many startups doing things the "modern" way (ie logs, distributed traces, metrics, ci/prod/stage/dev environemts etc going to a dozen different services) generate more metadata than actual data. I mean, we need to observe your systems but ultimately that data is a second class citizen to what we're observing in the first place, the application.
At the most absurd, I've seen observability systems replicated across 3 data centers for an application that hadn't been built yet. But don't worry, by the time it was released, they'd have an observability system so good it couldn't fail (narrator: "It did, in fact, fail.")
whazor 24 days ago
We store the data because we might need to know it. We only discover we didn’t need to know it once we’ve finished knowing it.
[-]
- binarylogic 24 days ago
  Agree to an extent. There are absolutely unknown unknowns. But I think you'd be surprised how much data is obviously waste. Not the grey area, just pure garbage: health checks, debug logs left in production, redundant attributes.
  That's why we break waste down into categories: https://docs.usetero.com/data-quality/categories/overview
  But we don't stop there. You can go deeper with reasoning to root out the more nuanced waste. It's hard, but it's possible. That's where things get interesting.
Veserv 24 days ago
You should not even need a regex; no serious logging system should be emitting formatted strings, JSON, etc. as a storage format. You are immediately incurring on the order of a 5-100x log size and 5-100x log performance overhead with any serialization format that poor. A properly performant logging system should be able to generate on the order of 100 million logs per second per core (assuming relatively small payloads).
At a minimum you should be using message template [1] serialization which is trivial to implement transparently on any logging system/statement with zero code changes to the emitter itself.
Any filtering done on top of that would then just be parsing structured data which is way easier than a regex, though of course that is somewhat beside the point of the article.
[1] https://messagetemplates.org/
glenjamin 24 days ago
This pitch seems ok to people using simple log aggregation tools or metric tools that have to be wary of tag cardinality
But how does it compare to an actual modern observability stack built on a columnar datastore like Honeycomb?
tot19 24 days ago
Lurked on HN for years, and finally a post that made me excited enough to create an account.
First of all, thanks for you (and the team’s) work on Vector. It is one of my favorite pieces of software, and I rave about it pretty much daily.
New endeavor sounds very exciting, and I definitely can relate to the problem. Are there plans to allow Teri to be used in on-premises environments and self-hosted?
Thank you and good luck!
[-]
- binarylogic 24 days ago
  Thank you for the nice comment. I'm glad you enjoy Vector. I poured myself into that software for many years. I'm a bit bummed with its current trajectory, though. We hope to bring the next evolution with Tero. There were many problems with Vector that I wished I could have fixed but was unable to. I hope to do those things with Tero (more to come!)
  And yes, Tero is fundamentally a control plane that hooks into your data plane (whatever that is for you: OTel Collector, Datadog Agent, Vector, etc). It can run on-prem, use your own approved AI, completely within your network, and completely private.
  [-]
  - tot19 24 days ago
    Appreciate the reply! Have you decided on a license yet?
time4tea 22 days ago
A lot of "regular expressions" are just text searches, or can be, and then you can use aho-corasick - which is implemented in many languages, and check a few regular expressions for the ones that really are.
Sure, nor perfect, but works surprisingly well.
mr-karan 24 days ago
Just want to say thanks for creating Vector. We use it heavily at Zerodha and wrote about our setup here: https://zerodha.tech/blog/logging-at-zerodha/
It replaced both Filebeat and Logstash for us with a single binary that actually has sane resource usage (no more JVM nightmares). VRL turned out to be way more powerful than we could imagine - we do all our log parsing, metadata enrichment, and routing to different ClickHouse tables in one place. The agent/aggregator topology with disk buffering is pretty dope.
Genuinely one of my favorite pieces of infra software. Good luck with Tero.
[-]
- binarylogic 24 days ago
  Thanks for the comment! Yes, I read that post. Great post. Feel free to reach out if you ever need help with Vector or have questions.
hinkley 24 days ago
> You run observability at your company. But really, you're the cost police. You wake up to a log line in a hot path, a metric tag that exploded cardinality. You chase down the engineer. They didn't do anything wrong, they're just disconnected from what any of this costs.
Somebody didn’t math right when calculating if moving off hostedgraphite and StatsD was going to save us money or boil us alive. We moved from an inordinate number of individual stats with interpolated names to much simpler names but with cardinality and then the cardinality police showed up and kept harping on me to fix it. We were the user and customer facing portion of a SaaS company and I told them to fuck off when we were 1/7 of the overall stats traffic. I’d already reduced the cardinality by 400x and we were months past the transition date and I just wanted to work on anything that wasn’t stats for a while. Like features for the other devs or for our customers.
Very frustrating process. I suspect there’s a Missing Paper out there on how to compress stat cardinality out there somewhere. I’ve done a bit of work in that area but my efforts are in the 20% range and we need an order of magnitude. My changes were more about reducing the storage for the tags and reduced string arithmetic a bit in the process.
kxbnb 15 days ago
The noise problem is real. Most observability tools optimize for "capture everything" which leads to exactly this waste.
We took a different approach with toran.sh - instead of instrumenting your entire stack, you create a read-only proxy for a specific upstream API. You see exactly what's being sent and received, nothing else. No SDK, no log parsing, just the raw request/response for that one integration.
Works well for debugging third-party API issues where you need to see what actually hit the wire, not what your code thought it sent.