8 comments

  • 8note 17 minutes ago
    as an iteration: what i'd want from an SRE agent is that it sets up and tests automated alarms

    i don't want non-determinism in whether my pager goes off when something breaks.

    I also want the agent to get a first look at issues once a ticket has been written. Find relevant logs metrics, dashboards, and put them into the ticket.

    then, i want it to take a first guess at an RCA, and whether it will solve itself by waiting.

    such that by the time i actually am awake, i can read through and decide if anything actually needs to be done.

    id also be fine writing up agent skills for how to solve common problems, and be able to run through those, but only if its rock solid. I dont want the agent to make a second issue when i just woke up.

  • blutoot 2 hours ago
    I'm a little confused. An agent's value-add is to automate what a human actor (in this case, an SRE) does and thus reduces the time taken to recovery, etc. A human SRE never manually detects an error - we already have well-established anomaly detection implementations and wiring them to some ticket generation tool is also an established pattern. My confusion is, what value the "agent" is bringing here. Nothing wrong in competing with the Datadogs of the world.
    • kemotep 2 hours ago
      I guess if you don’t want to have to pay for Rapid7 or are too lazy to configure the Teams/Slack integration for your EDR?

      But I mean you still have to pay for a Claude API with Moltclaw or whatever no?

    • nisegami 1 hour ago
      >A human SRE never manually detects an error - we already have well-established anomaly detection implementations and wiring them to some ticket generation tool is also an established pattern.

      I'm currently dealing with fallout at job because we were doing all this with humans with no alerts and we missed a couple major issues. This product could have prevented a lot of stress in my case, but it'd be a bit like a bandage on a missing limb.

      • lmkg 45 minutes ago
        That still begs the question though: There are existing tools and solutions that do this. Why not, and would this being AI make a difference?

        "My boss would be more likely to approve it" is a cynical but valid answer.

    • esseph 2 hours ago
      Logs are pretty dry sometimes.

      INFO gives you a ton but it's low SNR.

      WARN/ERROR may tell you that something could happen or is happening, but it doesn't tell you the ramifications of that may be. It could be nothing!

      Now imagine you're getting hundreds, thousands, millions of messages like this an hour? How do you determine what's really important? For instance, if a kubernetes pod on a single node runs out of space, that could be a problem if your app is only running in that node. But what if your app is spread against 30x nodes?

      It's a triage system with context, at least it sounds like it. It's helping you classify based on actual current or potential problems with the app in the ways that a plain log message does not.

      • xorcist 1 hour ago
        Deciphering ramifications from a log message alone is a pretty unusual way to approach a problem. You still have your 1990s Nagios-style application monitoring, right? So when you wake up to a message that the web monitor says it's not possible to add items to the shopping basket right now, the database monitor signals an unusually long response time, the application metrics tells you number of buys is at a fraction of what is normal for this time of day, then that WARN log message from the application telling you about a foreign index constraint is violated is pretty informative!
  • maknee 1 hour ago
    How effective are LLMs at triaging issues? Has anyone found success using them to find the root cause? I've only been able to triage effectively for toy examples.
  • mrweasel 1 hour ago
    LLMs aren't the fastest thing in the world, how much data can you realistically parse per second?
  • ramon156 2 hours ago
    You forgot to remove the bottom part, which is the same message but shortened. Did people just give up in general? I hate this world so much
  • f311a 1 hour ago
    Why is this upvoted? The author did not even bother to read what he wrote.

    > SOC 2 Type II ready

    Huh? You vibecoded the repo in a week and claim it ready?

  • rob 2 hours ago
    Hey bud, forgot to delete the original prompt at the end.
  • gostsamo 2 hours ago
    when are you renaming it to LogMolt?