13 comments

  • jldugger 30 minutes ago
    >Turns out you can compile tens of thousands of patterns and still match at line rate.

    Well, yea, sort of the magic of the regular expression <-> NFA equality theorem. Any regex can be converted to a state machine. And since you can combine regexes (and NFAs!) procedurally, this is not a surprising result.

    > I ran it against the first service: ~40% waste. Another: ~60%. Another: ~30%. On average, ~40% waste.

    I'm surprised it's only 40%. Observability seems to be treated like fire suppression systems: all important in a crisis, but looks like waste during normal operations.

    > The AI can't find the signal because there's too much garbage in the way.

    There's surprisingly simple techniques to filter out much of the garbage: compare logs from known good to known bad, and look for the stuff thats' strongly associated with bad. The precise techniques seem bayesian in nature, as the more evidence (logs) you get the more strongly associated it will appear.

    More sophisticated techniques will do dimensional analysis -- are these failed requests associated with a specific pod, availability zone, locale, software version, query string, or customer? etc. But you'd have to do so much pre-analysis, prompting and tool calls that the LLM that comprise today's AI won't provide any actual value.

    • binarylogic 15 minutes ago
      Yeah, it's funny, I never went down the regex rabbit hole until this, but I was blown away by Hyperscan/Vectorscan. It truly changes the game. Traditional wisdom tells you regex is slow.

      > I'm surprised it's only 40%.

      Oh, it's worse. I'm being conservative in the post. That number represents "pure" waste without sampling. You can see how we classify it: https://docs.usetero.com/data-quality/logs/malformed-data. If you get comfortable with sampling the right way (entire transactions, not individual logs), that number gets a lot bigger. The beauty of categories is you can incrementally root out waste in a way you're comfortable with.

      > compare logs from known good to known bad

      I think you're describing anomaly detection. Diffing normal vs abnormal states to surface what's different. That's useful for incident investigation, but it's a different problem than waste identification. Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever? A health check log isn't anomalous, it's just not worth keeping.

      You're right that the dimensional analysis and pre-processing is where the real work is. That's exactly what Tero does. It compresses logs into semantic events, understands patterns, and maps meaning before any evaluation happens.

  • hinkley 11 minutes ago
    One of the problems described here seems to be that the people building the dashboards aren’t the ones adding the instrumentation. Admittedly I’ve only worked on one project that was all in on telemetry instead of using log analysis. And even that one had one foot in Splunk and one in Grafana, but I worked there long enough to see that we mostly only had telemetry for charts at least someone on call used regularly. I got most of them out of Splunk but that wasn’t that hard. We hadn’t bought enough horsepower from them that it didn’t jam up if too many people got involved in diagnosing production issues.

    Occasionally I convinced them that certain charts were wrong and moved them to other stats to answer the same question, and some of those could go away.

    I also wrote a little tool to extract all the stats from our group’s dashboard so we could compare used to generated and I cut I’d say about a third? Which is in line with his anecdote. I then gave it to OPs and announced it at my skip level’s staff meeting so other people could do the same.

  • smithclay 35 minutes ago
    Kudos to Ben for speaking to one of the elephants in the room in observability: data waste and the impact it has on your bill.

    All major vendors have a nice dashboard and sometimes alerts to understand usage (broken down by signal type or tags) ... but there's clearly a need for more advanced analysis which Tero seems to be going after.

    Speaking of the elephant in room in observability: why does storing data on a vendor cost so much in the first place? With most new observability startups choosing to store store data in columar formats on cheap object storage, think this is also getting challenged in 2026. The combination of cheap storage with meaningful data could breathe some new life into the space.

    Excited to see what Tero builds.

  • glenjamin 13 minutes ago
    This pitch seems ok to people using simple log aggregation tools or metric tools that have to be wary of tag cardinality

    But how does it compare to an actual modern observability stack built on a columnar datastore like Honeycomb?

  • whazor 21 minutes ago
    We store the data because we might need to know it. We only discover we didn’t need to know it once we’ve finished knowing it.
  • peterldowns 1 hour ago
    Ben, you probably don't remember me but you hired me ages ago to help out with the Python client for Timber. Just want to say thanks for that opportunity — it's been amazing to watch you guys succeed.

    Also, I've ended up being responsible for infra and observability at a few startups now, and you are completely correct about the amount of waste and unnecessary cost. Looking forward to trying out Tero.

    • binarylogic 1 hour ago
      Hey Peter, I absolutely remember you! Thanks for the nice comment.

      And yes, data waste in this space is absurdly bad. I don't think people realize how bad it actually is. I estimate ~40% of the data (being conservative) is waste. But now we know - and knowing is half the battle :)

  • stackskipton 1 hour ago
    As Ops (DevOps/Sysadmin/SREish) person here, excellent article.

    However, as always, the problem is more political than technical and those are hardest problems to solve and another service with more cost IMO won't solve it. However, there is plenty of money to be made in attempting to solve it so go get that bag. :)

    At end of day, it's back to DevOps mentality and it's never caught on at most companies. Devs don't care, Project Manager wants us to stop block feature velocity and we are not properly staffed since we are "massive wasteful cost center".

    • binarylogic 1 hour ago
      100% accurate. It is very much political. I'd also add that the problem is perpetuated by a disconnection between engineers who produce the data and those who are responsible for paying for it. This is somewhat intentional and exploited by vendors.

      Tero doesn't just tell you how much is waste. It breaks down exactly what's wrong, attributes it to each service, and makes it possible for teams to finally own their data quality (and cost).

      One thing I'm hoping catches on: now that we can put a number on waste, it can become an SLO, just like any other metric teams are responsible for. Data quality becomes something that heals itself.

      • stackskipton 8 minutes ago
        I'd be shocked if you can accurately identify waste since you are not ultimately familiar with the product.

        Sure, I've kicked over what I thought was waste but told it's not or "It is but deal Ops"

  • matanyall 2 hours ago
    It's so funny, I've never done a cost-benefit analysis of having "good monitoring" and then still not being able to figure out what broke and needing to pull in someone who doesn't need the monitoring at all because they built the thing.
  • binarylogic 2 hours ago
    I spent a decade in observability. Built Vector, spent three years at Datadog. This is what I think is broken with observability and why.
    • otterley 1 hour ago
      And how are you solving the problem? The article does not say.

      > I'm answering the question your observability vendor won't

      There was no question answered here at all. It's basically a teaser designed to attract attention and stir debate. Respectfully, it's marketing, not problem solving. At least, not yet.

      • binarylogic 6 minutes ago
        The question is answered in the post: ~40% on average, sometimes higher. That's a real number from real customer data.

        The post doesn't go deep on how because it's more about why this question matters. But happy to get concrete.

        The short version: we connect to your observability platform (read-only), build a semantic understanding of your data, and surface specific rules about what's waste. Each rule maps to a category (health checks, debug logs, redundant attributes, etc.) that you can inspect and verify before acting on.

        The longer version: https://docs.usetero.com/introduction/how-tero-works

      • quadrature 28 minutes ago
        theres more information here https://docs.usetero.com/introduction/how-tero-works the link in the article is broken.

        They determine what events/fields are not used and then add filters to your observability provider so you dont pay to ingest them.

    • yorwba 1 hour ago
      I'm curious about the deep details, but the link 404s.
      • binarylogic 9 minutes ago
        My apologies, I fixed the link. So much for restructuring the docs the night before posting this.

        You can read more here: https://docs.usetero.com/data-quality/overview

        To loosely describe our approach: it's intentionally transparent. We start with obvious categories (health checks, debug logs, redundant attributes) that you can inspect and verify. No black box.

        But underneath, Tero builds a semantic understanding of your data. Each category represents a progression in reasoning, from "this is obviously waste" to "this doesn't help anyone debug anything." You start simple, verify everything, and go deeper at your own pace.

  • hinkley 28 minutes ago
    > You run observability at your company. But really, you're the cost police. You wake up to a log line in a hot path, a metric tag that exploded cardinality. You chase down the engineer. They didn't do anything wrong, they're just disconnected from what any of this costs.

    Somebody didn’t math right when calculating if moving off hostedgraphite and StatsD was going to save us money or boil us alive. We moved from an inordinate number of individual stats with interpolated names to much simpler names but with cardinality and then the cardinality police showed up and kept harping on me to fix it. We were the user and customer facing portion of a SaaS company and I told them to fuck off when we were 1/7 of the overall stats traffic. I’d already reduced the cardinality by 400x and we were months past the transition date and I just wanted to work on anything that wasn’t stats for a while. Like features for the other devs or for our customers.

    Very frustrating process. I suspect there’s a Missing Paper out there on how to compress stat cardinality out there somewhere. I’ve done a bit of work in that area but my efforts are in the 20% range and we need an order of magnitude. My changes were more about reducing the storage for the tags and reduced string arithmetic a bit in the process.

  • tot19 1 hour ago
    Lurked on HN for years, and finally a post that made me excited enough to create an account.

    First of all, thanks for you (and the team’s) work on Vector. It is one of my favorite pieces of software, and I rave about it pretty much daily.

    New endeavor sounds very exciting, and I definitely can relate to the problem. Are there plans to allow Teri to be used in on-premises environments and self-hosted?

    Thank you and good luck!

    • binarylogic 1 hour ago
      Thank you for the nice comment. I'm glad you enjoy Vector. I poured myself into that software for many years. I'm a bit bummed with its current trajectory, though. We hope to bring the next evolution with Tero. There were many problems with Vector that I wished I could have fixed but was unable to. I hope to do those things with Tero (more to come!)

      And yes, Tero is fundamentally a control plane that hooks into your data plane (whatever that is for you: OTel Collector, Datadog Agent, Vector, etc). It can run on-prem, use your own approved AI, completely within your network, and completely private.

      • tot19 1 hour ago
        Appreciate the reply! Have you decided on a license yet?
  • mr-karan 41 minutes ago
    Just want to say thanks for creating Vector. We use it heavily at Zerodha and wrote about our setup here: https://zerodha.tech/blog/logging-at-zerodha/

    It replaced both Filebeat and Logstash for us with a single binary that actually has sane resource usage (no more JVM nightmares). VRL turned out to be way more powerful than we could imagine - we do all our log parsing, metadata enrichment, and routing to different ClickHouse tables in one place. The agent/aggregator topology with disk buffering is pretty dope.

    Genuinely one of my favorite pieces of infra software. Good luck with Tero.