Interesting approach. In our experience, most failures weren’t about which interface agents used, but about how much implicit authority they accumulated across steps. Control boundaries mattered more than the abstraction layer.
I’d like to see this other browser plugin’s API be exposed via your same CLI, so I don’t have to only control a separate browser instance.
https://github.com/remorses/playwriter
(I haven’t investigated enough to know how feasible it is, but as I was reading about your tool, I immediately wanted to control existing tabs from my main browser, rather than “just” a debug-driven separate browser instance.)
At this point I'm fully down the path of the agent just maintaining his own tools. I have a browser skill that continues to evolve as I use it. Beats every alternative I have tried so far.
Same. Claude Opus 4.5 one-shots the basics of chrome debug protocol, and then you can go from there.
Plus, now it is personal software... just keep asking it to improve the skill based on you usage. Bake in domain knowledge or business logic or whatever you want.
I'm using this for e2e testing and debugging Obsidian plugins and it is starting to understand Obsidian inside and out.
Creator of Browser Use here, this is cool, really innovative approach with ARIA roles. One idea we have been playing around with a lot is just giving the LLM raw html and a really good way to traverse it - no heuristics, just BS4. Seems to work well, but much more expensive than the current prod ready [index]<div ... notation
Cool to see lots of people independently come to "CLIs are all you need". I'm still not sure if it's a short-term bandaid because agents are so good at terminal use or if it's part of a longer term trend but it's definitely felt much more seamless to me then MCPs.
100% - sharing CLIs with the agent has felt like another channel to interact with them once I’ve done it enough, like a task manager the agent and I can both use using the same interface
Yes, I'm using Dagger and it has great secret support, obfuscating them even if the agent, for example, cats the contents of a key file, it will never be able to read or print the secret value itself
tl;Dr there are a lot of ways to keep secret contents away from your agent, some without actually having to keep them "physically" separate
If you look at Elixir keynote for Phoenix.new -- a cool agentic coding tool -- you'll see some hints about a browser control using a API tool call. It's called "web" in the video.
The main difference is likely the targeting philosophy. webctl relies heavily on ARIA roles/semantics (e.g. role=button name="Save") rather than injected IDs or CSS selectors. I find this makes the automation much more robust to UI changes.
Also, I went with Python for V1 simply for iteration speed and ecosystem integration. I'd love to rewrite in Rust eventually, but Python was the most efficient way to get a stable tool working for my specific use case.
I don't have an objective benchmark yet. I tried several existing solutions, especially the MCP servers for browser automation, and none of them were able to reproducibly solve my specific task.
An objective benchmark is a great idea, especially to compare webctl against other similar CLI-based tools. I'll definitely look into how to set that up.
A background daemon holds the session state between different CLI calls. This daemon is started automatically on the first webctl call and auto-closes after a timeout period of inactivity to save resources.
I’d like to see this other browser plugin’s API be exposed via your same CLI, so I don’t have to only control a separate browser instance. https://github.com/remorses/playwriter (I haven’t investigated enough to know how feasible it is, but as I was reading about your tool, I immediately wanted to control existing tabs from my main browser, rather than “just” a debug-driven separate browser instance.)
Plus, now it is personal software... just keep asking it to improve the skill based on you usage. Bake in domain knowledge or business logic or whatever you want.
I'm using this for e2e testing and debugging Obsidian plugins and it is starting to understand Obsidian inside and out.
(my one of many contribution https://github.com/caesarnine/binsmith)
Nevertheless, I prefer the CLI for other reasons: it is built for humans and is much easier to debug.
tl;Dr there are a lot of ways to keep secret contents away from your agent, some without actually having to keep them "physically" separate
Video: https://youtu.be/ojL_VHc4gLk?t=2132
More discussion: https://simonwillison.net/2025/Jun/23/phoenix-new/
https://github.com/rumca-js/crawler-buddy
More like a framework for other mechanisms
How is it different?
The main difference is likely the targeting philosophy. webctl relies heavily on ARIA roles/semantics (e.g. role=button name="Save") rather than injected IDs or CSS selectors. I find this makes the automation much more robust to UI changes.
Also, I went with Python for V1 simply for iteration speed and ecosystem integration. I'd love to rewrite in Rust eventually, but Python was the most efficient way to get a stable tool working for my specific use case.
"browser automation for ai agents" is a popular idea these days.
An objective benchmark is a great idea, especially to compare webctl against other similar CLI-based tools. I'll definitely look into how to set that up.