Formal Verification Gates for AI Coding Loops

(reubenbrooks.dev)

73 points | by pyrex41 5 hours ago

8 comments

alexpotato 1 hour ago
I work in DevOps at a firm that has been very enthusiastic about using LLMs (in the good sense).
The phases were basically:
- try out having the LLM do "a lot"
- now even more
- now run multiple agents
- back to single agents but have the agents build tools
- tools that are deterministic AND usable by both the humans (EDIT: and the LLMs)
The reasons:
1. Deterministic tools (for both deployments and testing) get you a binary answer and it's repeatable
2. In the event of an outage, you can always fall back to the tool that a human can run
3. It's faster. A quick script can run in <30 seconds but "confabulating" always seemed to take 2-3 minutes.
Really, we are back to this article: https://spawn-queue.acm.org/doi/10.1145/3194653.3197520 aka "make a list of tasks, write scripts for each task, combine the scripts into functions, functions become a system"
[-]
- worik 0 minutes ago
  Sounds like the bleeding edge.
  LLMs are tools, but unreliable
  They can magnify the reach of a person, but not replace them
  Having LLMs write the tools is the correct approach for magnifying the reach of a Dev Ops programmer
solomonb 1 hour ago
This is really cool, but why wouldn't you just use a more richly typed target language and skip this process? You could use Liquid Haskell (for refinement types) or Lean (for full dependent types) and be able to put these invariants directly in your program rather then in a sidecar.
[-]
- pyrex41 1 hour ago
  If you are the kind of person that immediately reaches for this solution -- then I agree, yes you should. You could even do it in Shen! (https://news.ycombinator.com/item?id=39602472, https://news.ycombinator.com/item?id=9297665)
  But, for everyone else: even if you skipped the sidecar entirely, didn't use the codegen, you just had the AI interpret the spec'd application into a short Shen proof, iterate until it's internally consistent / compiles...now you have a spec that is internally consistent, straightforward for human to understand, and much stronger context for the LLM than English language spec alone.
- pyrex41 59 minutes ago
  I would add, re: Shen -- it's sequent calculus and unique type system (type system itself is Turing complete) give you a lot of flexibility here.
  [-]
  - solomonb 29 minutes ago
    Shen is one of those projects i've always wanted to dig into a bit but have never found the time sadly.
appstorelottery 3 hours ago
I think it's all about keeping state in the determinant space. I've come across the same issue, the key was to not rely on LLM performing workflow - the runtime needs to enforce.
singron 3 hours ago
These guard types are great and I've heavily used them in the past. But why codegen them?
E.g. the jwt auth example has some major problems since the verification rules aren't fully specified in the spec. The jwt-token verified rule only checks that the string isn't empty, but it doesn't actually verify that it is correctly parsed, non-expired, and signed by a trusted key. The authenticated-user rule doesn't check that the user-id actually came from the jwt. If you hand-wrote your constructor, you would ensure these things. Similarly, all the other constructors allow passing in whatever values you like instead of checking the connections of the real objects.
By calling the constructor for these types, you are making an assertion about the relationship of the parameter values. If AI is calling the constructor, then it's able to make it's own assertions and derive whatever result it wants. That seems backwards. AI should use the result of tenant-access to deduce that a user is a member of tenant, but if they can directly call `(tenant-access user-id tenant-id true)`, then they can "prove" tenant-access for anything. In the past, we have named the constructors for these types `TenantAccess.newUnverified`, and then heavily scrutinized all callers (typically just jwt-parsers and specific database lookups). You can then use `TenantAccess.{userId,tenantId}` without scrutiny elsewhere.
[-]
- pyrex41 2 hours ago
  [dead]
eximius 3 hours ago
So, capabilities/type systems. Building code architecture guardrails steep enough the AI won't jump the fence/take shortcuts.
pyrex41 5 hours ago
Author here. The TL;DR: move rules from prompts into types the compiler refuses to violate, then bounce the AI coding loop off those refusals. The repo is github.com/pyrex41/Shen-Backpressure. Builds a lot on Geoff Huntley's backpressure idea -- none of this is rocket science, just an effort to apply sound programming principles in a world of LLM coding agents.
[-]
- ajb 1 hour ago
  TBH something like this sounds useful even without LLMs (although I haven't fully grokked this yet). The problem with the operational level is that you can't express the invariants etc at the type level - not least because you're working across multiple languages - so the kind of dumb issues that we're beginning to rule out at the level of the language at the process level still require lots of diligence in operational code. Some kind of shared "operational type system" that could be integrated into relevant languages would potentially help a lot.
  [-]
  - pyrex41 48 minutes ago
    Shen has some really unique properties that are under-developed here. It's type system itself is Turing complete and very flexible / expressive. Also, the Shen kernel is extremely compact, and easy to port into a wide variety of runtime languages (C, Lisp, Ruby, Python, JS, Go, etc https://shen-language.github.io/#downloads). What I discussed about using it as a compile-time gate + codegen is just scratching the surface, I think.
    Now, a lot of the ports haven't been maintained. But the underlying Shen kernel is only 4-5k lines of code...remains extremely portable. More discussion here https://news.ycombinator.com/item?id=39602472
    I didn't focus a ton on Shen in the blog post, because the underlying principles aren't really about the implementation. Shen is very cool tho.
- Mikhail_K 3 hours ago
  Thank you, interesting work. Please, clarify what is possibly a naive question - your README states that the constraints imposed by your tool are weaker than the formal verification guarantees. Why not implement the backpressure as the full formal verification barrier? Too complex to implement?
  [-]
  - pyrex41 2 hours ago
    [flagged]
- alasano 3 hours ago
  [dead]
KaiShips 1 hour ago
[flagged]
max_unbearable 1 hour ago
[flagged]