I've had a very similar experience optimising a hidden markov model prediction tool I work on. I wanted to experiment with an alternative architecture and data structures. Opus 4.7 did the refactor, and eventually the only hot spot became the maths kernel. Over the course of an hour or two it iteratively rewrote that code using all the usual optimisations to improve branching, cache usage, vectorisation, etc. It reviewed the disassembly and the hardware counters with perf to verify that the changes were working as intended. It could have taken me several days to cover that much ground doing low level optimisations - and I would have spent most of it grappling with gcc, perf, searching for information about particular SIMD instructions, etc.
How do you know that? What information do you have that would explain your position? We are talking about a specific circumstance and you have brought unsupported generalities to the discussion.
> For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn’t make further progress and stopped. Qwen3.7-Max didn’t stop.
By this reasoning I could release a model that lacks all the basic optimisations. Have it optimise itself for hours to reach 20x the throughput and then claim that the model is superior to the others?
I am not saying that is what happened here, but the reporting is abysmal.
Obligatory: Either written by AI or by a human who has spent so much time with AI that they adopted its writing style. Anyways.
> Over 35 hours it performed 432 kernel evaluations. Each cycle meant writing code, compiling it, running it, reading the profiling output, deciding what to change, and trying again. The model diagnosed compilation failures it hadn’t seen before, identified performance bottlenecks through runtime feedback rather than prior knowledge, and redesigned the kernel architecture multiple times when incremental improvements stopped working.
Anyone remember genetic algorithms? This might be an improvement, but it still feels a little like deja vu.
Yeah, I remember. I still have Usenet postings about the genetic algorims conference back in the '90s and some magazine clippings about researcher from the University of Sussex where I first learned about genetic algorithms back in high school.
See the authors twitter, he speaks english at a rather basic level and certainly did not write this https://x.com/mohitgeryani/with_replies
In my experience, AI fixes problems by mostly adding more code.
It's a short term gain for a long term hurt.
In my experience, humans unfortunately tend to do the same.
> For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn’t make further progress and stopped. Qwen3.7-Max didn’t stop.
By this reasoning I could release a model that lacks all the basic optimisations. Have it optimise itself for hours to reach 20x the throughput and then claim that the model is superior to the others?
I am not saying that is what happened here, but the reporting is abysmal.
> Over 35 hours it performed 432 kernel evaluations. Each cycle meant writing code, compiling it, running it, reading the profiling output, deciding what to change, and trying again. The model diagnosed compilation failures it hadn’t seen before, identified performance bottlenecks through runtime feedback rather than prior knowledge, and redesigned the kernel architecture multiple times when incremental improvements stopped working.
Anyone remember genetic algorithms? This might be an improvement, but it still feels a little like deja vu.