Mercury 2: The fastest reasoning LLM, powered by diffusion

(inceptionlabs.ai)

88 points | by fittingopposite 4 hours ago

16 comments

serjester 0 minutes ago
There's a potentially amazing use case here around parsing PDFs to markdown. It seems like a task with insane volume requirements, low budget, and the kind of thing that doesn't benefit much from autoregression. Would be very curious if your team has explored that.
cjbarber 2 hours ago
It could be interesting to do the metric of intelligence per second.
ie intelligence per token, and then tokens per second
My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.
But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.
[-]
- estsauver 17 minutes ago
  I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.
  Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf
  [-]
  - refulgentis 12 minutes ago
    I'm very worried for both.
    Cerebras requires a $3K/year membership to use APIs.
    Groq's been dead for about 6 months, even pre-acquisition.
    I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview
    [-]
    - freeqaz 2 minutes ago
      You can call Cerebras APIs via OpenRouter if you specify them as the provider in your request fyi. It's a bit pricier but it exists!
- volodia 32 minutes ago
  We agree! In fact, there is an emerging class of models aimed at fast agentic iteration (think of Composer, the Flash versions of proprietary and open models). We position Mercury 2 as a strong model in this category.
- bigbuppo 40 minutes ago
  Maybe make that intelligence per token per relative unit of hardware per watt. If you're burning 30 tons of coal to be 0.0000000001% better than the 5 tons of coal option because you're throwing more hardware at it, well, it's not much of a real improvement.
- josephg 53 minutes ago
  Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?
  At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!
- nubg 1 hour ago
  Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!
volodia 56 minutes ago
Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.
[-]
- nowittyusername 14 minutes ago
  How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...
  [-]
  - volodia 6 minutes ago
    There are many ways to do it, but the simplest approach is block diffusion: https://m-arriola.com/bd3lms/
    There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.
- techbro92 19 minutes ago
  Do you think you will be moving towards drifting models in the future for even more speed?
  [-]
  - volodia 17 minutes ago
    Not imminently, but hard to predict where the field will go
- kristianp 45 minutes ago
  How big is Mercury 2? How many tokens is it trained on?
  Is it's agentic accuracy good enough to operate, say, coding agents without needing a larger model to do more difficult tasks?
  [-]
  - volodia 38 minutes ago
    You can think of Mercury 2 as roughly in the same intelligence tier as other speed-optimized models (e.g., Haiku 4.5, Grok Fast, GPT-Mini–class systems). The main differentiator is latency — it’s ~5× faster at comparable quality.
    We’re not positioning it as competing with the largest models (Opus 4.5, etc.) on hardest-case reasoning. It’s more of a “fast agent” model (like Composer in Cursor, or Haiku 4.5 in some IDEs): strong on common coding and tool-use tasks, and providing very quick iteration loops.
- CamperBob2 49 minutes ago
  Seems to work pretty well, and it's especially interesting to see answers pop up so quickly! It is easily fooled by the usual trick questions about car washes and such, but seems on par with the better open models when I ask it math/engineering questions, and is obviously much faster.
  [-]
  - volodia 42 minutes ago
    Thanks for trying it and for the thoughtful feedback, really appreciate it. And we’re actively working on improving quality further as we scale the models.
dvt 2 hours ago
What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.
[-]
- volodia 36 minutes ago
  That is also our view! We see Mercury 2 as enabling very fast iteration for agentic tasks. A single shot at a problem might be less accurate, but because the model has a shorter execution time, it enables users to iterate much more quickly.
exabrial 11 minutes ago
I believe Jimmy Chat is still faster by an order of magnitude…
nowittyusername 21 minutes ago
Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.
[-]
- volodia 16 minutes ago
  Would love to hear about your experience. Send us an email.
nylonstrung 1 hour ago
I'm not sold on diffusion models.
Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases
Here's more detail on how price/performance stacks up
https://artificialanalysis.ai/models/mercury-2
[-]
- volodia 50 minutes ago
  I’d push back a bit on the Pareto point.
  On speed/quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is >5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.
  That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.
chriskanan 40 minutes ago
I can see some promise with diffusion LLMs, but getting them comparable to the frontier is going to require a ton of work and these closed source solutions probably won't really invigorate the field to find breakthroughs. It is too bad that they are following the path of OpenAI with closed models without details as far as I can tell.
lprimeisafk 40 minutes ago
It fails the car wash test
[-]
- emi2k01 35 minutes ago
  The test:
  > The car wash is 100m away from my house. Should I walk or drive?
  I didn't find a way to share the conversation, so here it is the full response in `medium` settings:
  ---
  Walking 100 m is a very short distance—roughly a minute or two at a normal pace. In most situations, walking will be quicker overall because you avoid the time needed to get into the car, start it, and find parking. It also saves fuel and is better for the environment.
  *Consider these factors before deciding:*
  | Factor | Walking | Driving | |--------|---------|---------| | *Time* | ~1–2 min (plus a few seconds to step out) | ~1 min driving, plus ~1–2 min to get in/out and park | | *Energy/Fuel* | None | Uses a small amount of fuel | | *Weather* | Comfortable if dry & mild | Better if it’s raining, very hot, or icy | | *Physical ability* | Easy for most people | Not needed for such a short trip | | *Car availability* | Not required | Must have a car nearby and be willing to use it | | *Environmental impact* | Zero emissions | Small emissions |
  *Recommendation:* If the weather is pleasant and you don’t need the car for anything else right away, walking is the simplest, fastest, and most eco‑friendly choice. Drive only if you’re dealing with inclement weather, have heavy items to carry, or need the car immediately after the wash.
  Do you have any specific constraints (e.g., rain, heavy bags, time pressure) that might affect the decision?
  [-]
  - rtfeldman 15 minutes ago
    If a stranger asks me, "Should I walk or drive to this car wash?" then I assume they're asking in good faith and both options are reasonable for their situation. So it's a safe assumption that they're not going there to get their car washed. Maybe they're starting work there tomorrow, for example, and don't know how pedestrian-friendly the route is.
    Is the goal behind evaluating models this way to incentivize training them to assume we're bad-faith tricksters even when asking benign questions like how best to traverse a particular 100m? I can't imagine why it would be desirable to optimize for that outcome.
    (I'm not saying that's your goal personally - I mean the goal behind the test itself, which I'd heard of before this thread. Seems like a bad test.)
ilaksh 2 hours ago
It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.
Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.
[-]
- volodia 44 minutes ago
  Thank you for your patience. We are working to handle the surge in demand.
mhitza 1 hour ago
Comment retracted. My bad, missed some details.
[-]
- selcuka 1 hour ago
  I think your comment is a bit unfair.
  > no reasoning comparison
  Benchmarks against reasoning models:
  https://www.inceptionlabs.ai/blog/introducing-mercury-2
  > no demo
  https://chat.inceptionlabs.ai/
  > no info on numbers of parameters for the model
  This is a closed model. Do other providers publish the number of parameters for their models?
  > testimonials that don't actually read like something used in production
  Fair point.
  [-]
  - volodia 45 minutes ago
    Just to clarify one point: Mercury (the original v1, non-reasoning model) is already used in production in mainstream IDEs like Zed: https://zed.dev/blog/edit-prediction-providers
    Mercury v1 focused on autocomplete and next-edit prediction. Mercury 2 extends that into reasoning and agent-style workflows, and we have editor integrations available (docs linked from the blog). I’d encourage folks to try the models!
  - mhitza 1 hour ago
    You are right edited my post (twice actually). Missed the chat first time around (though its hard to see it as a reasoning model when chain of thought is hidden, or not obvious. I guess this is the new normal), and also missed the reasoning table because text is pretty small on mobile and I thought its another speed benchmark.
- pants2 1 hour ago
  Reading such obvious LLM-isms in the announcement just makes me cringe a bit too, ex.
  > We optimize for speed users actually feel: responsiveness in the moments users experience — p95 latency under high concurrency, consistent turn-to-turn behavior, and stable throughput when systems get busy.
tl2do 2 hours ago
Genuine question: what kinds of workloads benefit most from this speed? In my coding use, I still hit limitations even with stronger models, so I'm interested in where a much faster model changes the outcome rather than just reducing latency.
[-]
- layoric 2 hours ago
  I think it would assist in exploiting exploring multiple solution spaces in parallel, and can see with the right user in the loop + tools like compilers, static analysis, tests, etc wrapped harness, be able to iterate very quickly on multiple solutions. An example might be, "I need to optimize this SQL query" pointed to a locally running postgres. Multiple changes could be tested, combined, and explain plan to validate performance vs a test for correct results. Then only valid solutions could be presented to developer for review. I don't personally care about the models 'opinion' or recommendations, using them for architectural choices IMO is a flawed use as a coding tool.
  It doesn't change the fact that the most important thing is verification/validation of their output either from tools, developer reviewing/making decisions. But even if don't want that approach, diffusion models are just a lot more efficient it seems. I'm interested to see if they are just a better match common developer tasks to assist with validation/verification systems, not just writing (likely wrong) code faster.
- volodia 27 minutes ago
  There are few: fast agents, deep research, real-time voice, coding. The other thing is that when you have a fast reasoning model, you spend more effort on thinking in the same latency budget, which pushed up quality.
- cjbarber 2 hours ago
  I've tried a few computer use and browser use tools and they feel relatively tok/s bottlenecked.
  And in some sense, all of my claude code usage feels tok/s bottlenecked. There's never really a time where I'm glad to wait for the tokens, I'd always prefer faster.
- irthomasthomas 2 hours ago
  multi-model arbitration, synthesis, parallel reasoning etc. Judging large models with small models is quite effective.
- quotemstr 1 hour ago
  Once you make a model fast and small enough, it starts to become practical to use LLMs for things as mundane as spell checking, touchscreen-keyboard tap disambiguation, and database query planning. If the fast, small model is multimodal, use it in a microwave to make a better DWIM auto-cook.
  Hell, want to do syntax highlighting? Just throw buffer text into an ultra-fast LLM.
  It's easy to overlook how many small day-to-day heuristic schemes can be replaced with AI. It's almost embarrassing to think about all the totally mundane uses to which we can put fast, modest intelligence.
dw5ight 29 minutes ago
this looks awesome!!
MarcLore 53 minutes ago
[dead]
dhruv3006 1 hour ago
I am little underwhelmed by anything diffusion at the moment - they didn't really deliver.
[-]
- quotemstr 1 hour ago
  What isn't these days? I've found it pointless to get upset about it.
  [-]
  - dhruv3006 59 minutes ago
    We need a new architecture - i wonder what ilya is cooking.
arjie 1 hour ago
Please pre-render your website on the server. Client-side JS means that my agent cannot read the press-release and that reduces the chance I am going to read it myself. Also, day one OpenRouter increases the chance that someone will try it.