GLM 5.2 Performance Benchmarks

(artificialanalysis.ai)

75 points | by theanonymousone 6 hours ago

8 comments

wongarsu 2 hours ago
It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark
[-]
- SilverServer 26 minutes ago
  It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
  it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.
- andai 51 minutes ago
  This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?
  [-]
  - wongarsu 11 minutes ago
    Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something
  - whimblepop 27 minutes ago
    Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.
  - Zababa 4 minutes ago
    They are, especially multiple choice questions. The same happens with humans exams:
    Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.
    If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.
theturtletalks 2 hours ago
I want to trust their benchmarks but when they have Muse Spark over GPT-5.5, it gives me pause.
[-]
- mdasen 16 minutes ago
  Where do you see that? I see they have GPT-5.5 (xhigh) at 55, GPT-5.5 (high) at 53, and Muse Spark at 43. Muse Spark does beat GPT-5.4 mini (xhigh) which scores 40, but the key there is "mini".
  In the coding index, GPT-5.5 gets 59.1, 58.5, 56.2, and 52.1 for xhigh, high, medium, and low while Muse Spark is behind at 47.5. For agentic, GPT-5.5 gets 74.1, 72.0, 69.4, and 59.7 (xhigh, high, medium, low) while Muse Spark gets 62.0 (beating only GPT-5.5 low).
  GPT-5.5 only gets beaten by Opus 4.8 in their general index, is the top spot for coding, and is #3 behind Opus 4.8 and GLM-5.2 for agentic (excluding Fable 5 which takes the top spot, but is unavailable).
XCSme 1 hour ago
I also tested it[0]: quite similar to GLM 5, a few percent better, 30% faster and 50% more expensive.
[0]: https://aibenchy.com/?q=glm
[-]
- XCSme 1 hour ago
  PS: Just added a cool feature, so you can filter the leaderboard for multiple models at once, by using a comma, like: https://aibenchy.com/?q=glm,claude
- benxh 46 minutes ago
  benchmark where gemini flash is better than fable btw.
  [-]
  - XCSme 15 minutes ago
    Well, most people were not liking Fable when it was available anyway, because it refused to answer questions very often.
- lousken 1 hour ago
  still 1/4 of the price of anthropic and openai models though
lanycrost 2 hours ago
It's always nice to see how open source models growing, hope we will have good performance with lower tier hardware some day.
hemkeshr 1 hour ago
Local models are already useful today. The next milestone is getting this level of performance onto truly affordable hardware.
[-]
- SV_BubbleTime 7 minutes ago
  NVidia has less than zero reason to ship cards ideal for this at low prices.
  AMD’s stock price reflects a hope they launch a CUDA alternative. But this is unlikely for the near future.
  There is a lot of interest in preventing China coming in with cheap AI hardware.
  So I expect the direction to be good local models that few can run effectively.
sourcecodeplz 2 hours ago
still quite verbose at 140m output tokens, but this is on max thinking. high should do better.
ChrisArchitect 2 hours ago
Some more discussion: https://news.ycombinator.com/item?id=48567759
DeathArrow 3 hours ago
One or two more releases and they will reach Fable level.
[-]
- vitalyan123 1 hour ago
  by then there will be Fable 5.21, again 5% ahead of every other SotA while still only 500% the size.