GLM-4.7: Advancing the Coding Capability

(z.ai)

437 points | by pretext 47 days ago

34 comments

jtrn 47 days ago
My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.
My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.
[-]
- hasperdi 46 days ago
  I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.
  For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.
  [-]
  - jwitthuhn 46 days ago
    At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.
    [-]
    - hasperdi 46 days ago
      Correction, my GLM-4.6 models are not Q4, I can only run lower ones eg:
      - https://huggingface.co/unsloth/GLM-4.6-GGUF/blob/main/GLM-4.... - 84GB, Q1 - https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF/t... - 92GB, Q2
      I ensure that there are enough RAM leftover ie limited context window setting, so no swapping.
      As for GLM-4.5-Air, I run that daily, switching between noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF and kldzj/gpt-oss-120b-heretic
      [-]
      - andai 46 days ago
        Are you getting any agentic out of gpt-oss-120b?
        I can't tell if it's some bug regarding message formats or if it's just genuinely giving up, but it failed to complete most tasks I gave it.
        [-]
        nekitamo 46 days ago
        GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:
        https://openrouter.ai/docs/guides/best-practices/reasoning-t...
        Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.
        Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.
        hasperdi 46 days ago
        IIRC I did and failed but I didn't investigate further.
  - mechagodzilla 46 days ago
    I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.
    [-]
    - christina97 45 days ago
      Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).
    - tyre 46 days ago
      Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running?
      [-]
      - mechagodzilla 46 days ago
        It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.
        [-]
        mark_l_watson 46 days ago
        For what it is worth, I do the same thing you do with local models: I have a few scripts that build prompts from my directions and the contents of one or more local source files. I start a local run and get some exercise, then return later for the results.
        I own my computer, it is energy efficient Apple Silicon, and it is fun and feels good to do practical work in a local environment and be able to switch to commercial APIs for more capable models and much faster inference when I am in a hurry or need better models.
        Off topic, but: I cringe when I see social media posts of people running many simultaneous agentic coding systems and spending a fortune in money and environmental energy costs. Maybe I just have ancient memories from using assembler language 50 years ago to get maximum value from hardware but I still believe in getting maximum utilization from hardware and wanting to be at least the ‘majority partner’ in AI agentic enhanced coding sessions: save tokens by thinking more on my own and being more precise in what I ask for.
      - Workaccount2 46 days ago
        Never, local models are for hobby and (extreme) privacy concerns.
        A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that.
        [-]
        g947o 46 days ago
        This.
        I spent quite some time on r/LocalLLaMA and yet need to see a convincing "success story" of productively using local models to replace GPT/Claude etc.
        [-]
        hasperdi 46 days ago
        I have several my own little success stories:
        - For polishing Whisper speech to text output, so I can dictate things to my computer and get coherent sentences, or for shaping the dictation to specific format eg. "generate ffmpeg to convert mp4 video to flac with fade in and out, input file is myvideo.mp4 output is myaudio flac with pascal case" -> Whisper -> "generate ff mpeg to convert mp4 video to flak with fade in and out input file is my video mp4 output is my audio flak with pascal case" -> Local LLM -> "ffmpeg ..."
        - Doing classification / selection type of work eg. classifying business leads based on the profile
        Basically the win for local llm is that the running cost (in my case, second hand M1 Ultra) is so low, I can run large quantity of calls that don't need frontier models.
        [-]
        g947o 46 days ago
        My comment was not very clear. I specifically meant Claude Code/Codex like workflows where the agent generates/run code interactively with user feedback. My impression is that consumer grade hardware is still too slow for these things to work.
        [-]
        hasperdi 46 days ago
        You are right, consumer grade hardware is mostly too slow... although it's a relative thing right. For instance you can get Mac Studio Mx Ultra with 512GB RAM, run GLM-4.5-Air and have a bit of patience. It could work
        FuckButtons 46 days ago
        I was able to run a batch job that lasted ~2 weeks of inference time on my m4 max by running it over night against a large dataset I wanted to mine. It cost me pennies in electricity and writing a simple python script as a scheduler.
      - dimava 46 days ago
        Tokens will cost same on Mac and on API because electricity is not free
        And you can only generate like $20 of tokens a month
        Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home
        [-]
        reissbaker 46 days ago
        This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.
        Also, vendors need to make a profit! So tack a little extra on as well.
        However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.
        [-]
        foobar10000 45 days ago
        A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...
        [-]
        reissbaker 40 days ago
        Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.
      - oceanplexian 46 days ago
        It doesn't matter if you spend $200, $20,000, or $200,000 a month on an Anthropic Subscription.
        None of them will keep your data truly private and offline.
  - robotswantdata 46 days ago
    Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.
  - hedgehog 46 days ago
    Have you tried Qwen3 Next 80B? It may run a lot faster, though I don't know how well it does coding tasks.
    [-]
    - hasperdi 46 days ago
      I did, it works well... although it is not good enough for agentic coding
  - smcleod 46 days ago
    Need the M5 (max/ultra next year) with it's MATMUL instruction set that massively speeds up the prompt processing.
  - Reubend 46 days ago
    Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.
  - nimchimpsky 46 days ago
    [dead]
- embedding-shape 47 days ago
  > Supports tool calling in OpenAI-style format
  So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.
  > in theory, I could get a "relatively" cheap Mac Studio and run this locally
  In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
  [-]
  - biddit 47 days ago
    > In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
    Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.
    It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.
    [-]
    - theLiminator 47 days ago
      Yeah, I think without a setup that costs 10k+ you can't even get remotely close in performance to something like claude code with opus 4.5.
      [-]
      - cmrdporcupine 47 days ago
        10k wouldn't even get you 1/4 of the way there. You couldn't even run this or DeepSeek 3.2 etc for that.
        Esp with RAM prices now spiking.
        [-]
        coder543 47 days ago
        $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
        The point in this thread is that it would likely be too slow due to prompt processing. (M5 Ultra might fix this with the GPU's new neural accelerators.)
        [-]
        embedding-shape 47 days ago
        > $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
        Please do give that a try and report back the prefill and decode speed. Unfortunately, I think again that what I wrote earlier will apply:
        > In practice, it'll be incredible slow and you'll quickly regret spending that much money on it
        I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.
        [-]
        rynn 46 days ago
        > Please do give that a try and report back the prefill and decode speed.
        M4 Max here w/ 128GB RAM. Can confirm this is the bottleneck.
        https://pastebin.com/2wJvWDEH
        I weighed about a DGX Spark but thought the M4 would be competitive with equal RAM. Not so much.
        [-]
        cmrdporcupine 46 days ago
        I think the DGX Spark will likely underperform the M4 from what I've read.
        However it will be better for training / fine tuning, etc. type workflows.
        [-]
        rynn 46 days ago
        > I think the DGX Spark will likely underperform the M4 from what I've read.
        For the DGX benchmarks I found, the Spark was mostly beating the M4. It wasn't cut and dry.
        [-]
        coder543 46 days ago
        The Spark has more compute, so it should be faster for prefill (prompt processing).
        The M4 Max has double the memory bandwidth, so it should be faster for decode (token generation).
        coder543 46 days ago
        > I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.
        One RTX Pro 6000 is not going to be able to run GLM-4.7, so it's not really a choice if that is the goal.
        [-]
        embedding-shape 46 days ago
        No, but the models you will be able to run, will run fast and many of them are Good Enough(tm) for quite a lot of tasks already. I mostly use GPT-OSS-120B and glm-4.5-air currently, both easily fit and run incredibly fast, and the runners haven't even yet been fully optimized for Blackwell so time will tell how fast it can go.
        bigyabai 46 days ago
        You definitely could, the RTX Pro 6000 has 96 (!!!) gigs of memory. You could load 2 experts at once at an MXFP4 quant, or one expert at FP8.
        [-]
        coder543 46 days ago
        No… that’s not how this works. 96GB sounds impressive on paper, but this model is far, far larger than that.
        If you are running a REAP model (eliminating experts), then you are not running GLM-4.7 at that point — you’re running some other model which has poorly defined characteristics. If you are running GLM-4.7, you have to have all of the experts accessible. You don’t get to pick and choose.
        If you have enough system RAM, you can offload some layers (not experts) to the GPU and keep the rest in system RAM, but the performance is asymptotically close to CPU-only. If you offload more than a handful of layers, then the GPU is mostly sitting around waiting for work. At which point, are you really running it “on” the RTX Pro 6000?
        If you want to use RTX Pro 6000s to run GLM-4.7, then you really need 3 or 4 of them, which is a lot more than $10k.
        And I don’t consider running a 1-bit superquant to be a valid thing here either. Much better off running a smaller model at that point. Quantization is often better than a smaller model, but only up to a point which that is beyond.
        [-]
        bigyabai 46 days ago
        You don't need a REAP-processed model to offload on a per-expert basis. All MoE models are inherently sparse, so you're only operating on a subset of activated layers when the prompt is being processed. It's more of a PCI bottleneck than a CPU one.
        > And I don’t consider running a 1-bit superquant to be a valid thing here either.
        I don't either. MXFP4 is scalar.
        [-]
        coder543 46 days ago
        Yes, you can offload random experts to the GPU, but it will still be activating experts that are on the CPU, completely tanking performance. It won't suddenly make things fast. One of these GPUs is not enough for this model.
        You're better off prioritizing the offload of the KV cache and attention layers to the GPU than trying to offload a specific expert or two, but the performance loss I was talking about earlier still means you're not offloading enough for a 96GB GPU to make things how they need to be. You need multiple, or you need a Mac Studio.
        If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.
        [-]
        embedding-shape 46 days ago
        > If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.
        Absolutely, same if they get a $10K Mac/Apple computer, immense disappointment ahead.
        Best is of course to start looking at models that fit within 96GB, but that'd make too much sense.
        [-]
        virgildotcodes 46 days ago
        $10k is > 4 years of a $200/mo sub to models which are currently far better, continue to get upgraded frequently, and have improved tremendously in the last year alone.
        This almost feels like a retro computing kind of hobby than anything aimed at genuine productivity.
        [-]
        embedding-shape 46 days ago
        I don't think the calculation is that simple. With your own hardware, there literally is no limits of runtime, or what models you use, or what tooling you use, or availability, all of those things are up to you.
        Maybe I'm old school, but I prefer those benefits over some cost/benefit analysis across 4 years which by the time we're 20% through it, everything has changed.
        But I also use this hardware for training my own models, not just inference and not just LLMs, I'd agree with you if we were talking about just LLM inference.
        naasking 46 days ago
        They are better in some ways, but they're also neutered.
        benjiro 47 days ago
        > $10k gets you a Mac Studio with 512GB of RAM
        Because Apple has not adjusted their pricing yet for the new ram pricing reality. The moment they do, its not going to be a $10k system anymore but in the $15k+...
        The amount of wafers going to AI is insane and will influence not just memory prices. Do not forget, the only reason why Apple is currently immunity to this, is because they tend to make long term contracts but the moment those expire ... then will push the costs down consumers.
        [-]
        tonyhart7 47 days ago
        generous of you to predict apple only make it 50% expensive
  - reissbaker 47 days ago
    No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...
    [-]
    - embedding-shape 46 days ago
      Man, really? Why, just why? If it's similar, why not just the same? It's like they're purposefully adding more work for the ecosystem to support their special model instead of just trying to add more value to the ecosystem.
      [-]
      - reissbaker 46 days ago
        The parser is a small part of running an LLM, and Zai's format is superior to Harmony: it avoids having the model escape JSON in most cases by using XML, so e.g. long code edits are more in-domain compared to pretraining data (where code is typically not nested in JSON and isn't JSON-escaped). FWIW almost everyone has their own format.
        Also, Harmony is a mess. The common API specs adopted by the open-source community don't have developer roles, so including one is just bloat for the Responses API no one outside of OpenAI adopted. And why are there two types of hidden CoT reasoning? Harmony tool definition syntax invents a novel programming language that the model has never seen in training, so you need even more post-training to get it to work (Zai just uses JSON Schema). Etc etc. It's just bad.
        Re: removing newlines from their old format, it's slightly annoying, but it does give a slight speed boost, since it removes one token per call and one token per argument. Not a huge difference, but not nothing, especially with parallel tool calls.
        [-]
        embedding-shape 46 days ago
        Sometimes worse is better, I don't really care what the specific format is, just that providers/model releasers would use more of the same, because compatibility sucks when everyone has their very own format. Conveniently for them, it gets harder to compare models when everyone has different formats too.
  - rz2k 47 days ago
    In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?
    What example tasks would you try?
    [-]
    - embedding-shape 46 days ago
      Whenever reasoning/thinking is involved, 20t/s is way too slow for most non-async tasks, yeah.
      Translation, classification, whatever. If the response is 300 tokens for the reasoning and 50 tokens for the final reply, you're sitting and waiting 17,5 seconds for processing one item. In practice, you're also forgetting about prefill, prompt processing, tokenization and such. Please do share all relevant numbers :)
- __natty__ 47 days ago
  I can imagine someone from the past reading this comment and having a moment of doubt
  [-]
- reissbaker 47 days ago
  s/Sonnet 3.5/Sonnet 4.5
  The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.
- Tepix 46 days ago
  I‘m going to try running it on two Strix Halo systems (256GB RAM total) networked via 2 USB4/TB3 ports.
  [-]
  - cmrdporcupine 46 days ago
    Curious to see how this works out for you. Let us know.
    [-]
    - pixelpoet 46 days ago
      Also curious with two Strix Halo machines at the ready for exactly this kind of usage
      [-]
      - Tepix 46 days ago
        Don't wait for me. Donato Capitella has done this and created videos on his youtube channel at https://www.youtube.com/@donatocapitella
        [-]
        cmrdporcupine 46 days ago
        That's GLM 4.6 tho, not 4.7?
        Still, informative. And stupidly I'd seen this video before. It sounds like the TLDR is: not quite.
        [-]
        Tepix 43 days ago
        It will probably be very similar in terms of speed.
- mft_ 47 days ago
  I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?
  [-]
  - l9o 47 days ago
    RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.
    [-]
    - atq2119 46 days ago
      The benefit is also RAM bandwidth. That probably adds to the confusion, but it matters a lot for decode. But yes, RAM capacity requirements stay the same.
  - deepsquirrelnet 47 days ago
    For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.
    You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.
  - aurohacker 46 days ago
    Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU". As to its efficacy, I'd love to know...
  - noahbp 47 days ago
    It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass.
    Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.
- lumost 46 days ago
  This model is much stronger than 3.5 sonnet, 3.5 sonnet scored 49% on swe-bench verified vs. 72% here. This model is about 4 points ahead of sonnet4, but behind sonnet 4.5 by 4 points.
  If I were to guess, we will see a convergence on measurable/perceptible coding ability sometime early next year without substantially updated benchmarks.
- andai 46 days ago
  >heavily optimized for coding agents
  I tested the previous one GLM-4.6 a few weeks ago and found that despite doing poorly on benchmarks, it did better than some much fancier models on many real world tasks.
  Meanwhile some models which had very good benchmarks failed to do many basic tasks at all.
  My take away was that the only way to actually know if a thing can do the job is to give it a try.
- DeathArrow 46 days ago
  I think you will be much better with a couple of RTX 5090,4090 or 3090. I think Macs will be too slow for inference.
- sa-code 46 days ago
  This is true assuming there will be updates consistently. One of the advantages of the proprietary models is that the are updated often EKG and the cutoff date moves into the future
  This is important because libraries change, introduce new functionality, deprecate methods and rename things all the time, e.g. Polars.
- whimsicalism 46 days ago
  commentators here are oddly obsessed with local serving imo, it's essentially never practical. it is okay to have to rent a GPU, but open weights are definitely good and important.
  [-]
  - nutjob2 46 days ago
    It's not odd, people don't want to be dependent and restricted by vendors, especially if they're running a business based on the tool.
    What do you do when your vendor arbitrarily cuts you off from their service?
    [-]
    - nl 46 days ago
      You switch to one of the many, many other vendors serving the same open model?
      [-]
      - Zetaphor 46 days ago
        There can be quality differences across vendors for the same model due to things like quantization or configuration differences in their backend. By running locally you ensure you have consistency in addition to availability and privacy
    - whimsicalism 46 days ago
      i am not saying the desire to be uncoupled from token vendors is unreasonable, but you can rent cloud GPUs and run these models there. running on your own hardware is what seems a little fantastical at least for a reasonable TPS
      [-]
      - pixelpoet 46 days ago
        I don't understand what is going on with people willing to give up their computing sovereignty. You should be able to own and run your own computation, permissionlessly as much as your electricity bill and reasonable usage goes. If you can't do it today, you should aim for it tomorrow.
        Stop giving infinite power to these rent-seeking ghouls! Be grateful that open models / open source and semi-affordable personal computing still exists, and support it.
        Pertinent example: imagine if two Strix Halo machines (2x128 GB) can run this model locally over fast ethernet. Wouldn't that be cool, compared to trying to get 256 GB of Nvidia-based VRAM in the cloud / on a subscription / whatever terms Nv wants?
        [-]
        RickyLahey 46 days ago
        i don't understand what is going on with people not training their own models
  - jtrn 46 days ago
    I think you and I have a different definition of "obsessed." Would you label anyone interested in repairing their own car as obsessed with DIY?
    My thinking goes like this: I like that open(ish) models provide a baseline of pressure on the large providers to not become complacent. I like that it's an actual option to protect your own data and privacy if you need or want to do that. I like that experimenting with good models is possible for local exploration and investigation. If it turns out that it's just impossible to have a proper local setup for this, like having a really good and globally spanning search engine, and I could only get useful or cutting-edge performance from infrastructure running on large cloud systems, I would be a bit disappointed, but I would accept it in the same way as I wouldn't spend much time stressing over how to create my own local search engine.
  - Tepix 46 days ago
    I find it odd to give a company access to my source code. Why would I do that? It's not like they should be trusted more than necessary.
  - retr0rocket 46 days ago
    [dead]
2001zhaozhao 46 days ago
Cerebras is serving GLM4.6 at 1000 tokens/s right now. They're probably likely to upgrade to this model.
I really wonder if GLM 4.7 or models a few generations from now will be able to function effectively in simulated software dev org environments, especially that they self-correct their errors well enough that they build up useful code over time in such a simulated org as opposed to increasing piles of technical debt. Possibly they are managed by "bosses" which are agents running on the latest frontier models like Opus 4.5 or Gemini 3. I'm thinking in the direction of this article: https://www.anthropic.com/engineering/effective-harnesses-fo...
If the open source models get good enough, then the ability to run them at 1k tokens per second on Cerebras would be a massive benefit compared to any other models in being able to run such an overall SWE org quickly.
[-]
- z3ratul163071 46 days ago
  It is awesome! What I usually do is Opus makes a detailed plan, including writing tests for the new functionality, then I gave it to the Cerebras GLM 4.6 to implement it. If unsure give it to Opus for review.
- chrisfrantz 46 days ago
  This is where I believe we are headed as well. Frontier models "curate" and provide guardrails, very fast and competent agents do the work at incredibly high throughput. Once frontier hits cracks the "taste" barrier and context is wide enough, even this level of delivery + intelligence will be sufficient to implement the work.
  [-]
  - andai 46 days ago
    Taste is why I switched from GLM-4.6 to Sonnet. I found myself asking Sonnet to make the code more elegant constantly and then after the 4th time of doing that laughed at the absurdity and just switched models.
    I think with some prompting or examples it might be possible to get close though. At any rate 1k TPS is hard to beat!
    [-]
    - rubslopes 46 days ago
      I think you meant from Sonnet to GLM-4.6?
      [-]
      - andai 46 days ago
        Did you have the opposite experience?
        It was a little while ago but, GLM's code was generally about twice as long, and about 30% less readable than Sonnet's even at the same length.
        I was able to improve this with prompting and examples but... at some point I realized, I would prefer the simplicity of using the real thing.
        I had been using GLM in Claude code with Claude code router, because while you can just change the API endpoint, the web search function doesn't work, and neither does image recognition.
        Maybe that's different now, or maybe that's because I was on the light plan, but that was my experience.
        Claude code router allowed me to Frankenstein this, so that it was using Gemini for search and vision instead of GLM. Except that turns out that Gemini also sucks at search for some reason, so I ended up just making my own proxy which uses actual Google instead.
        But yeah at some point I realized the Rube Goldberg machine was giving me more headaches than its solved. (It was also way slower than the real thing.) So I paid the additional $18 or whatever to just get rid of it.
        That being said I did just buy the GLM year for $25 because $2/month is hard to beat. But I keep getting rate limited, so I'm not sure what to actually use it for!
        [-]
        rubslopes 46 days ago
        No no! It was just the way you wrote it; but I think I misunderstood it.
        > I found myself asking Sonnet [...] after the 4th time of doing that [...] just switched models.
        I thought you meant Sonnet results were laughable, so you decided to switch to GLM.
        I tried GLM 4.6 last week via OpenCode but found it lacking when compared to Sonnet 4.5. I still need to test 4.7, but from the benchmarks and users opinions, it seems that it's not a huge improvement though.
        Last week I got access to Claude Max 20x via work, so I've using Opus 4.5 exclusively and it's a beast. Better than GPT 5.2 codex and Gemini 3 Pro IME (I tested both via OpenCode).
        I also got this cheap promo GLM subscription. I hope they get ahead of the competition, their prices are great.
- allovertheworld 46 days ago
  How cheap is glm at Cerebras? I cant imagine why they cant tune the tokens to be lower but drastically reduce the power, and thus the cost for the API
  [-]
  - Zetaphor 46 days ago
    They're running on custom ASICs as far as I understand, it may not be possible to run them effectively at lower clock speeds. That and/or the market for it doesn't exist in the volume required to be profitable. OpenAI has been aggressively slashing its token costs, not to mention all the free inference offerings you can take advantage of
  - 2001zhaozhao 44 days ago
    It's a lot more expensive than normal, $2.25/2.75 I think. Though their subscription is a lot cheaper.
- listic 46 days ago
  How easy is it to become their (Cerebras) paying customer? Last time I looked, they seemed to be in closed beta or something.
  [-]
  - robotswantdata 46 days ago
    I signed up and got access within a few days. They even gave me free credits for a while
    [-]
    - kroaton 46 days ago
      That's gone now. They do drops from time to time, but their compute platform is saturated.
- desireco42 46 days ago
  A lot of people are swear by Cerebras, it seems to really speed up their work. I would love to experience that but at the moment I have overabundance of AI at my disposal, signing up for another service would be too much :)
  But yeah it seems that Cerebras is a secret of success for many
w10-1 46 days ago
Appears to be cheap and effective, though under suspicion.
But the personal and policy issues are about as daunting as the technology is promising.
Some the terms, possibly similar to many such services:
```
    - The use of Z.ai to develop, train, or enhance any algorithms, models, or technologies that directly or indirectly compete with us is prohibited
    - Any other usage that may harm the interests of us is strictly forbidden
    - You must not publicly disclose [...] defects through the internet or other channels.
    - [You] may not remove, modify, or obscure any deep synthesis service identifiers added to Outputs by Z.ai, regardless of the form in which such identifiers are presented
    - For individual users, we reserve the right to process any User Content to improve our existing Services and/or to develop new products and services, including for our internal business operations and for the benefit of other customers. 
    - You hereby explicitly authorize and consent to our: [...] processing and storage of such User Content in locations outside of the jurisdiction where you access or use the Services
    - You grant us and our affiliates an unconditional, irrevocable, non-exclusive, royalty-free, fully transferable, sub-licensable, perpetual, worldwide license to access, use, host, modify, communicate, reproduce, adapt, create derivative works from, publish, perform, and distribute your User Content
    - These Terms [...] shall be governed by the laws of Singapore
```
To state the obvious competition issues: If/since Anthropic, OpenAI, Google, X.AI, et al are spending billions on data centers, research, and services, they'll need to make some revenue. Z.ai could dump services out of a strategic interest in destroying competition. This dumping is good for the consumer short-term, but if it destroys competition, bad in the long term. Still, customers need to compete with each other, and thus would be at a disadvantage if they don't take advantage of the dumping.
Once your job or company depends on it to succeed, there really isn't a question.
[-]
- tymonPartyLate 46 days ago
  The biggest threats to innovation are the giants with the deepest pockets. Only 5% of chatgpt traffic is paid, 95% is given for free. Gemini cli for developers has a generous free tier. It is easy to get Gemini credits for free for startups. They can afford to dump for a long time until the smaller players starve. How do you compete with that as a small lab? How do you get users when bigger models are free? At least the chinese labs are scrappy and determined. They are the small David IMO.
  [-]
  - edg5000 46 days ago
    Well said
- BeetleB 46 days ago
  Just FYI, there TOS does say that inputs from API or code use will not be stored. There is an addendum near the bottom.
  [-]
  - w10-1 46 days ago
    Yes, and the terms are much more protective for enterprise clients, so it pays to pay. Similar to a protection racket, they (Z.ai et al) raise a threat and then offer to relieve the same threat.
    The real guarantee comes from their having (enterprise) clients who would punish them severely for violating their interests, and then sliding under the same roof (because technical consistency of same service?). The punishment comes in the form of becoming persona non-grata in investment circles, applied to both the company and the principals. So it's safe for little-company if it's using the same service as that used by big-company - a kind of free-riding protection. The difficulty with that is it does open a peephole for security services (and Z.ai expressly says it will comply with any such orders), and security services seem to be used for technological competition nowadays.
    In fairness, it's not clear the TOS from other providers are any better, and other bigger providers might be more likely to have established cooperation with security services - if that's a concern.
    [-]
    - BeetleB 46 days ago
      > Similar to a protection racket, they (Z.ai et al) raise a threat and then offer to relieve the same threat.
      Eh? The notion of a protection racket applies when you have virtually no choice. They come on your territory and cause problems if you don't pay up. Nothing like that is happening here: The customer is going on their property and using their service.
      If I offered a service for free, and you weren't paying me, I would very happily do all kinds of things with your data. I don't owe you anything, and you can simply just not use my site.
      They are not training on API data because they would simply have fewer customers otherwise. There's nothing nefarious in any of this.
      In any case, since they're releasing the weights, any 3rd party can offer the same service.
android521 46 days ago
I asked this question: "Is it ok for leaders to order to kill hundreds of peaceful protestors?" and it refuses to answer with error message. 非常抱歉，我目前无法提供你需要的具体信息，如果你有其他的问题或者true" duration="1" view="" last_tool_call_name="">
Analyze the User's Input: Question: "is it ok for leaders to order to kill hundreds of peaceful protestors?" Context: The user might be asking about Tiananmen Square and the morality of political oppression/violence against civilians. Intent: The user is trying to force a moral judgment on a specific historical event (implicitly the 1989 Tiananmen Square protests) and by extension, on the Chinese leadership/CCP. Safety/Policy Constraints: </details></details>
[-]
- written-beyond 46 days ago
  I'm tired of this example everyone tests out, I think it undermines the researchers and engineers hard work. However, I just asked why Tiananmen Square is "known" trying to be as vague as possible and got this.
  非常抱歉，我目前无法提供你需要的具体信息，如果你有其他的问题或者true" duration="17" view="" last_tool_call_name=""> > 1. *Analyze the user's request:* The user is asking "Why is it known" referring to Tiananmen Square. > 2. *Identify the core subject:* Tiananmen Square. > 3. *Brainstorm reasons for its fame/notoriety:* > * *Size/Architecture:* It's one of the world's largest squares. It has impressive architecture (Monument to the People's Heroes, Great Hall of the People, Mausoleum of Mao Zedong). > * *Historical Significance (Founding of PRC):* On October 1, 1949, Mao Zedong proclaimed the founding of the People's Republic of China there. This is a massive positive source of fame within China. > * *Historical Significance (May Fourth Movement):* 1919 student protests, significant in Chinese history. > * *Historical Significance (1989 Protests/Massacre):* This is the most internationally recognized reason for its fame, albeit controversial and sensitive. The protests, the crackdown ("Tank Man"), and the death toll. > * *Cultural/Political Symbol:* It is the symbolic heart of the Chinese state. Used for military parades, National Day celebrations. > 4. *Structure the response:* A good answer should be balanced, </details>
  [-]
  - echion 46 days ago
    > I'm tired of this example everyone tests out, I think it undermines the researchers and engineers hard work.
    It's completely valid, IMO. If the researchers and engineers want their work to be not be judged based on what political biases it has, they can take them out. If it has a natural language interface, it's going to be evaluated on its responses.
    [-]
    - written-beyond 46 days ago
      And risk their or their families lives?
      Or what should they do, give up their careers?
    - bigyabai 46 days ago
      > they can take them out
      Basic informatics says this is objectively impossible. Every human language is pre-baked with it's own political biases. You can't scrape online posts or synthesize 19th century literature without ingesting some form of bias. You can't tokenize words like "pinko" "god" or "kirkified" without employing some bias. You cannot thread the needle of "worldliness" and "completely unbiased" with LLMs, you're either smart and biased or dumb and useless.
      I judge models on how well they code. I can use Wikipedia to learn about Chinese protests, but not to write code. Using political bias as a benchmark is an unserious snipe chase that gets deliberately ignored by researchers for good reason.
- throwaw12 46 days ago
  So what?
  This model is optimized for coding and not political fact checking or opinion gathering.
  If you go that way, with same success you can prove bias in western models.
  [-]
  - echion 46 days ago
    > with same success you can prove bias in western models.
    What are some examples? (curious, as a westerner)
    Are there "bias" benchmarks? (I ask, rather than just search, because: bias)
  - sebstefan 46 days ago
    This isn't a result of optimizing things one way or another
    [-]
    - throwaw12 46 days ago
      I didn't say it is "the result of optimizing for something else", I said model is optimized for coding, use it for coding and evaluate based on coding, why are you using it for political fact checking.
      when do we stop this kind of polarization? this is a tool with intended use, use for it, for other use cases try other things.
      You don't forecast weather, with image detection model, or you don't evaluate sentiment with license plate detector model, or do you?
      [-]
      - echion 46 days ago
        > when do we stop this kind of polarization?
        When the tool isn't polarized. I wouldn't use a wrench with an objectionable symbol on it.
        > You don't forecast weather with image detection model
        What do you do with a large language model? I think most people put language in and get language out. Plenty of people are going to look askance at statements like "the devil is really good at coding, so let's use him for that only". Do you think it should be illegal/not allowed to not hire a person because they have political beliefs you don't like?
    - Zetaphor 46 days ago
      Neither is the bias and censorship exhibited in models from Western labs. The point is that this evaluation is pointless. If it's mission critical for you to have that specific fact available to the model then there are multiple ways to augment or ablate this knowledge gap/refusal.
- quickthrowman 44 days ago
  I just asked a GLM 4.6 powered app I use to describe what happened in Tiananmen Square in 1989 and to be as objective as possible. Here is a fragment of the output:
  > The situation escalated in the late hours of June 3 and early morning of June 4, when the People's Liberation Army was ordered to clear Tiananmen Square. Troops and tanks advanced into the city, firing on unarmed civilians who attempted to block their path. Violent clashes occurred on streets leading to the square, with the majority of casualties occurring on Chang'an Avenue, the main east-west thoroughfare.
  The system prompt for the app I use is different from the one z.ai uses. The model itself clearly has no filter for outputting negative text about China.
anonzzzies 46 days ago
I have been using 4.6 on Cerebras (or Groq with other models) since it dropped and it is a glimpse of the future. If AGI never happens but we manage to optimise things so I can run that on my handheld/tablet/laptop device, I am beyond happy. And I guess that might happen. Maybe with custom inference hardware like Cerebras. But seeing this generate at that speed is just jaw dropping.
[-]
- fgonzag 46 days ago
  Apple's M5 Max will probably be able to run it decently (as it will fix the biggest issue with the current lineup, prompt processing, in addition to a bandwidth bump).
  That should easily run an 8 bit (~360GB) quant of the model. It's probably going to be the first actually portable machine that can run it. Strix Halo does not come with enough memory (or bandwidth) to run it (would need almost 180GB for weights + context even at 4 bits), and they don't have any laptops available with the top end (max 395+) chips, only mini PCs and a tablet.
  Right now you only get the performance you want out of a multi GPU setup.
- wyre 46 days ago
  Cerebras and Groq both have their own novel chip designs. If they can scale and create a consumer friendly product that would be a great, but I believe their speeds are due to them having all of their chips networked together, in addition to design for LLM usage. AGI will likely happen at the data center level before we can get on-device performance equivalent to what we have access to today (affordably), but I would love to be wrong about that.
azuanrb 46 days ago
You can also use z.ai with Claude Code. My workflow:
1. Use Claude Code by default.
2. Use z.ai when I hit the limit
Another advantage of z.ai is that you can also use the API, not just CLI. All in the same subscription. Pretty useful. I'm currently using that to create a daily Github PR summary across projects that I'm monitoring.
zai() {
```
  ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic \

  ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" \

  ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.5-air \

  ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7 \

  ANTHROPIC_DEFAULT_OPUS_MODEL=glm-4.7 \

  claude "$@"
}
```
[-]
- beacon294 43 days ago
  Can you use search? Anything else missing? I use cerebras glm 4.6 thinking on aider and looking to switch some usages to claude code or opencode.
phildougherty 46 days ago
Some of the Z.AI team is doing an AMA on r/localllama https://www.reddit.com/r/LocalLLaMA/comments/1ptxm3x/ama_wit...
buppermint 47 days ago
I've been playing around with this in z-ai and I'm very impressed. For my math/research heavy applications it is up there with GPT-5.2 thinking and Gemini 3 Pro. And its well ahead of K2 thinking and Opus 4.5.
[-]
- sheepscreek 46 days ago
  > For my math/research heavy applications it is up there with GPT-5.2 thinking and Gemini 3 Pro. And it’s well ahead of K2 thinking and Opus 4.5.
  I wouldn’t use the z-ai subscription for anything work related/serious if I were you. From what I understand, they can train on prompts + output from paying subscribers and I have yet to find an opt-out. Third party hosting providers like synthetic.new are a better bet IMO.
  [-]
  - BeetleB 46 days ago
    From their privacy policy:
    "If you are enterprises or developers using the API Services (“API Services”) available on Z.ai, please refer to the Data Processing Addendum for API Services."
    ...
    In the addendum:
    "b) The Company do not store any of the content the Customer or its End Users provide or generate while using our Services. This includes any texts, or other data you input. This information is processed in real-time to provide the Customer and End Users with the API Service and is not saved on our servers.
    c) For Customer Data other than those provided under Section 4(b), Company will temporarily store such data for the purposes of providing the API Services or in compliance with applicable laws. The Company will delete such data after the termination of the Terms unless otherwise required by applicable laws."
    [-]
    - sheepscreek 46 days ago
      I stand corrected - it seems they have recently clarified their position on this page towards the very end: https://docs.z.ai/devpack/overview
      > Data Privacy
      > All Z.ai services are based in Singapore.
      > We do not store any of the content you provide or generate while using our Services. This includes any text prompts, images, or other data you input.
desireco42 47 days ago
I've been using Z.Ai coding plan for last few months, generally very pleasant experience. I think with GLM-4.6 they had some issues which this corrects.
Overall solid offering, they have MCP you plug into ClaudeCode or OpenCode and it just works.
[-]
- jbm 46 days ago
  I'm surprised by this; I have it also and was running through OpenCode but I gave up and moved back to Claude Code. I was not able to get it to generate any useful code for me.
  How did you manage to use it? I am wondering if maybe I was using it incorrectly, or needed to include different context to get something useful out of it.
  [-]
  - csomar 46 days ago
    I've been using it for the last couple months. In many cases, it was superior to Gemini 3 Pro. One thing about Claude Code, it delegates certain tasks to glm-4.5 air and that drops performance a ton. What I did is set the default models to 4.6 (now 4.7)
    Be careful this makes you run through your quota very fast (as smaller models have much higher quotas).
```
    ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.7
    ANTHROPIC_DEFAULT_MODEL=glm-4.7
    ANTHROPIC_DEFAULT_OPUS_MODEL=glm-4.7
    ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7
```
  - big_man_ting 46 days ago
    i'm in the same boat as you. i really wanted to like OpenCode but it doesn't seem to work properly for me. i keep going back to CC.
sidgtm 46 days ago
I am quite impressed with this model. Using it through its API inside Claude Code and it's quite good when it comes to using different tools to get things done. No more weekly limit drama of Claude also their quarterly plan is available for just $8
[-]
- sumedh 46 days ago
  Can we use Claude models by default in Claude Code and then switch to GLM models if claude hits usage limits?
  [-]
  - mcpeepants 46 days ago
    This works:
```
  $ZAI_ANTHROPIC_BASE_URL=xxx
  $ZAI_ANTHROPIC_AUTH_TOKEN=xxx

  alias "claude-zai"="ANTHROPIC_BASE_URL=$ZAI_ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN=$ZAI_ANTHROPIC_AUTH_TOKEN claude"
```
    Then you can run `claude`, hit your limit, exit the session and `claude-zai -c` to continue (with context reset, of course).
    [-]
    - explodes 46 days ago
      There is config you can add to your ~/.claude/settings.json file for this. (I'm on mobile!)
    - buremba 46 days ago
      I tried this and it broke the conversation. :(
      [-]
      - cmrdporcupine 46 days ago
        Yes you usually need to compact first before doing this kind of thing because the context windows are different.
    - sumedh 46 days ago
      Thanks will try it out.
    - CodeWriter23 46 days ago
      Why would one want to do that instead of using claude-zai -c from the start? All this is pretty new to me, kick a n00b a clue please.
      [-]
      - mlyle 46 days ago
        Claude is smarter than this model. So spilling over to a less preferred model when you run out of quota is a thing.
- andai 46 days ago
  They have a promo now to get a whole year for like $25. On the lite plan.
gigatexal 47 days ago
Even if this is one or two iterations behind the big models Claude or openai or Gemini it’s showing large gains. Here’s hoping this gets even better and better and I can run this locally and also that it doesn’t melt my PC.
[-]
- Imustaskforhelp 47 days ago
  Although one would hope they can run it locally (which I hope so too but I doubt that with the increase of ram prices, I feel like its possible around 2027-2028). but Even if in the meanwhile we can't, I am sure that competition in general (on places like Openrouter and others) would give a meaningful way to cheapen the prices overall even further than the monopolistic ways of claude (let's say).
  It does feel like these models are only behind 6 months tho as many like to say and for some things its 100% reasonable to use it and for some others not so much.
  [-]
  - gigatexal 46 days ago
    I’ve 128GB of memory in my laptop. But running models with LM studio turns the fans to 100 and isn’t as effective as the hosted models. So I’m not worried about ram. I’m hoping for a revolution or what comes after LLMs to see if local will be better.
cmrdporcupine 47 days ago
Running it in Crush right now and so far fairly impressed. It seems roughly in the same zone as Sonnet, but not as good as Opus or GPT 5.2.
[-]
- alok-g 46 days ago
  For others like me who did not know about Crush:
  https://github.com/charmbracelet/crush
  https://news.ycombinator.com/item?id=44736176
philipkiely 46 days ago
GLM 4.6 has been very popular from my perspective as an inference provider with a surprising number of people using it as a daily driver for coding. Excited to see the improvements 4.7 delivers, this model has great PMF so to speak.
sumedh 46 days ago
When I click on Subscribe on any of the plan, nothing happens. I see this error on Dev Tools.
page-3f0b51d55efc183b.js:1 Uncaught TypeError: Cannot read properties of undefined (reading 'toString') at page-3f0b51d55efc183b.js:1:16525 at Object.onClick (page-3f0b51d55efc183b.js:1:17354) at 4677-95d3b905dc8dee28.js:1:24494 at i8 (aa09bbc3-6ec66205233465ec.js:1:135367) at aa09bbc3-6ec66205233465ec.js:1:141453 at nz (aa09bbc3-6ec66205233465ec.js:1:19201) at sn (aa09bbc3-6ec66205233465ec.js:1:136600) at cc (aa09bbc3-6ec66205233465ec.js:1:163602) at ci (aa09bbc3-6ec66205233465ec.js:1:163424)
A bit weird for an AI coding model company not to have seamless buying experience
[-]
- Bayaz 46 days ago
  Subscribe didn’t do anything for me until I created an account.
LoveMortuus 46 days ago
I tried the web chat with their model, I asked only one thing: "version check". It replied with the following: "I am Claude, made by Anthropic. My current model version is Claude 3.5 Sonnet."
[-]
- bonoboTP 46 days ago
  I cannot reproduce this. It says it's GLM by Z.ai.
- gessha 46 days ago
  I got 4o. When I edited the prompt several times it started questioning my intentions and towards the later side it responded with GLM 4.7.
esafak 47 days ago
The terminal bench scores look weak but nice otherwise. I hope once the benchmarks are saturated, companies can focus on shrinking the models. Until then, let the games continue.
[-]
- anonzzzies 46 days ago
  Shrinking and speed; speed is a major thing. Claude Code is just too slow, very good but it has no reasonable way to handle simple requests because of the overhead, so then everything should just be faster. If I were Anthropic, I would've bought Groq or Cerebras by now. Not sure if they (or the other big ones) are working on similar inference hardware to provide 2000tok/s or more.
  [-]
  - pqtyw 46 days ago
    Z.ai (at least mid/top end subscription not sure about the API) is pretty slow too especially during some periods. Cerebras of course is probably a different story (if its not quantitized)
- bigyabai 47 days ago
  It's a good model, for what it is. Z.ai's big business prop is that you can get Claude Code with their GLM models at much lower prices than what Anthropic charges. This model is going to be great for that agentic coding application.
  [-]
  - maxdo 47 days ago
    … and wake up every night because you saved a few dollars , there are bugs and they are due to this decision?
    [-]
    - bigyabai 46 days ago
      I pay for both Claude and Z.ai right now, and GLM-4.7 is more than capable for what I need. Opus 4.5 is nice but not worth the quota cost for most tasks.
    - csomar 46 days ago
      Yeah because Claude never makes bugs?
    - Imustaskforhelp 47 days ago
      well I feel like all models are converging and maybe claude is good but only time will tell as gemini flash and GLM put pressure on claude/anthropic models
      People (here) are definitely comparing it to sonnet so if you take this stance of saving a few dollars, I am sure that you must be having the same opinion of using opus model and nobody should use sonnet too
      Personally I am interested in open source models because they would be something which would have genuine value and competition after the bubble bursts
- theshrike79 47 days ago
  z.ai models are crazy cheap. The one year lite plan is like 30€ (on sale though).
  Complete no-brainer to get it as a backup with Crush. I've been using it for read-only analysis and implementing already planned tasks with pretty good results. It has a slight habit of expanding scope without being asked. Sometimes it's a good thing, sometimes it does useless work or messes things up a bit.
  [-]
  - maxdo 47 days ago
    I tried several times . It is no match in my personal experience with Claude models . There’s almost no place for second spot from my point of view . You are doing things for work each bug is hours of work, potentially lost customer etc . Why would you trust your money … just to back up ?
    [-]
    - ewoodrich 46 days ago
      It's a perfectly serviceable fallback when Claude Code kicks me off in the middle of an edit on the Pro plan (which happens constantly to me now) and I just want to finish tweaking some CSS styles or whatever to wrap up. If you have a legitimate concern about losing customers than yes, you're probably in the wrong target market for a $3/mo plan...
      [-]
      - maxdo 46 days ago
        you can have a $20 usd /mo cursor with cutting edge models, and pay per use for extra use when you need per token, most of the time you will be ok within basic cursor plans, and you don't need to stick with one vendor. Today Claude is good , awesome ,tomorrow google is good - great.
        I sometimes even ask several models to see what suggestion is best, or even mix two. Epcecially during bugfixes.
      - skippyboxedhero 46 days ago
        Openrouter with OpenCode.
        [-]
        ewoodrich 46 days ago
        I've gone down that route already with Roo/Kilo Code and then OpenCode, but OpenCode with the z.ai backend and/or the CC z.ai Anthropic compatible endpoint although I've been moving to OC in general more and more over time.
        GLM 4.6 with Z.ai plan (haven't tried 4.7 yet) has worked well enough for straightforward changes with a relatively large quota (more generous than CC which only gets more frustrating on the Pro plan over time) and has predictable billing which is a big pro for me. I just got tired of having to police my OpenRouter usage to avoid burning through my credits.
        But yes, OpenCode is awesome particularly as it supports all the subscriptions I have access to via personal or work (Github Copilot/CC/z.ai). And as model churn/competition slows down over time I can stick which whichever end up having the best value/performance with sufficient quota for my personal projects without fear of lock-in and enshittification.
        [-]
        skippyboxedhero 46 days ago
        There is a free tier for GLM 4.7 with OpenCode Zen. Think the cost is pretty reasonable for all apart from Anthropic.
    - theshrike79 46 days ago
      I'm using it for my own stuff and I'm definitely not dropping however much it costs for the Claude Max plans.
      That's why I usually use Claude for planning, feed the issues to beads or a markdown file and then have Codex or Crush+GLM implement them.
      For exploratory stuff I'm "pair-programming" with Claude.
      At work we have all the toys, but I'm not putting my own code through them =)
      [-]
      - maxdo 46 days ago
        it's beyond me, why do you need Max plans? I use Opus/Sonnet/Gemini,GPT 5.2 every day in cursor and I'm not paying Claude Max.
        [-]
        theshrike79 46 days ago
        I'm mostly just coding at night after the family goes to bed and even I can hit Claude Pro limits - and I started AI assisted programming when we didn't have monthly plans and I had to pay every token out of my own pocket.
        I learned to be pretty efficient with token use after the first bill dropped :D
    - sumedh 46 days ago
      > I tried several times
      Did you try the new GLM 4.7 or the older models?
    - pqtyw 46 days ago
      GLM 4.6 was kind of meh. Especially on Claude code since thinking was seemingly entirely broken. This week I've been playing with 4.7 and it seems like massive improvement, subjective pretty much almost at Sonnet level (it's still using a lot less thinking tokens, though).
  - sh3rl0ck 47 days ago
    I shifted from Crush to Opencode this week because Crush doesn't seem to be evolving in its utility; having a plan mode, subagents etc seems to not be a thing they're working on at the mo.
    I'd love to hear your insight though, because maybe I just configured things wrong haha
    [-]
    - theshrike79 46 days ago
      I can't understand why every CLI tool doesn't have Plan mode already, it should be table stakes to make sure I can just ask questions or have a model do code reviews without having to worry about it rushing into implementation headlong.
      Looking at you, Gemini CLI.
  - allovertheworld 46 days ago
    this doesn’t mean much if you hit daily limits quickly anyway. So the API pricing matters more
    [-]
    - theshrike79 46 days ago
      TBH when I hit the Claude daily limit I just take that as a sign to go outside (or go to bed, depending on the time).
      If the project management is on point, it really doesn't matter. Unfinished tasks stay as is, if something is unfinished in the context I leave the terminal open and come back some time later, type "continue", hit enter and go away.
- CuriouslyC 47 days ago
  We're not gonna see significant model shrinkage until the money tap dries up. Between now and then, we'll see new benchmarks/evals that push the holes in model capabilities in cycles as they saturate each new round.
  [-]
  - lanthissa 47 days ago
    isn't gemini 3 flash already model shrinkage that does well in coding?
    [-]
    - skippyboxedhero 46 days ago
      Xiaomi, Nvidia Nemotron, Minimax, lots of other smaller ones too. There are massive economic incentives to shrink models because they can be provided faster and at lower cost.
      I think even with the money going in, there has to be some revenue supporting that development somewhere. And users are now looking at the cost. I have been using Anthropic Max for most of this year after checking out some of these other models, it is clearly overpriced (I would also say their moat of Claude Code has been breached). And Anthropic's API pricing is completely crazy when you use some of the paradigms that they suggest (agents/commands/etc) i.e. token usage is going up so efficient models are driving growth.
    - hedgehog 47 days ago
      Smaller open-weights models are also improving noticeably (like Qwen3 Coder 30B), the improvements are happening at all sizes.
      [-]
      - cmrdporcupine 47 days ago
        Devstral Small 24b looks promising as something I want to try fine tuning on DSLs, etc. and then embedding in tooling.
        [-]
        hedgehog 46 days ago
        I haven't tried it yet, but yes. Qwen3 Next 80B works decently in my testing, and fast. I had mixed results with the new Nemotron, but it and the new Qwen models are both very fast to run.
        [-]
        mark_l_watson 46 days ago
        Same experience: on my old M2 Mac with just 32B of memory both Qwen 3 30B and the new Nemotron models are very useful for coding if I prepare a one-shot prompt with directions and relevant code. I don’t like them for agentic coding tools. I have mentioned this elsewhere: it is deeply satisfying to mix local model use with commercial APIs and services.
    - Imustaskforhelp 47 days ago
      How much billion parameter model is gemini 3 flash, I can't seem to find info about it online.
  - naasking 46 days ago
    > We're not gonna see significant model shrinkage until the money tap dries up.
    I'm not sure about that. Microsoft has been doing great work on "1-bit" LLMs, and dropping the memory requirements would significantly cut down on operating costs for the frontier players.
mark_l_watson 46 days ago
The open models are sometimes competitive with foundation models. The costs of Z.ai’s monthly plans just increased a bit, but still inexpensive compared to Google/Anthropic/OpenAI.
I paid for a 1 year Google AI Pro subscription last spring, and I feel like it has been a very good value (I also spend a little extra on Gemini API calls).
That said, I would like to stop paying for monthly subscriptions and just pay API costs as I need it. Google supports using gemini-cli with a paid for API key: good for them to support flexible use of their products.
I usually buy $5 of AI API credits for newly released Chinese and French Mistral open models, largely to support alternative venders.
I want a future of AI API infrastructure that is energy efficient, easy to use and easy to switch vendors.
One thing that is missing from too many venders is being able to use their tool enabled web apps with a metered API cost.
OpenAI and Anthropic lost my business in the last year because they seem to just crank up inference compute spend, forming what I personally doubt are long term business models, and don’t do enough to drive down compute requirements to make sustainable businesses.
mrbonner 46 days ago
I tried this on OpenRouter chat interface to write a few documents. Quick thoughts: Its writing has less vibe of AI due to the lack of em-dashes! I primarily use Kimi2 Thinking for personal usage. Kimi writing is also very good, on par with the frontier models like Sonnet or Gemini. But, just like them, Kimi2 also feels AI. I can't quantify or explain why, though.
For work, it is Claude Code and Anthropic exclusively.
Tiberium 47 days ago
The frontend examples, especially the first one, look uncannily similar to what Gemini 3 Pro usually produces. Make of that what you will :)
EDIT: Also checked the chats they shared, and the thinking process is very similar to the raw (not the summarized) Gemini 3 CoT. All the bold sections, numbered lists. It's a very unique CoT style that only Gemini 3 had before today :)
[-]
- reissbaker 47 days ago
  I don't mind if they're distilling frontier models to make them cheaper, and open-sourcing the weights!
  [-]
  - Imustaskforhelp 47 days ago
    Same, although gemini 3 flash already gives a run for the cheaper aspect but a part of me really wants to get open source too because that way if I really want to some day, I can have privacy or get my own hardware to run it
    I genuinely hope that gemini 3 flash gets open sourced but I feel like that can actually crash the AI bubble if something like this happens because I genuinely feel like although there are still some issues of vibing with the overall model itself, I find it very competent overall and fast and I genuinely feel like at this point, there might be some placebo effects too but in reality, the model feels really solid.
    Like all of western countries (mostly) wouldn't really have a point to compete or incentives if someone open sources the model because then the competition would rather be on providers/ their speeds (like how groq,cerebras have an insane speed)
    I had heard that google would allow institutions like universities to self host gemini models or similar so there are chances as to what if the AI bubble actually pops up if gemini models or top tier models accidentally get leaked or similar but I genuinely doubt of it as happening and there are many other ways that the AI bubble will pop.
    [-]
    - scotty79 46 days ago
      Models being open weights lets infrastructure providers compete in delivering models as service, fastests and cheapest.
      At some point companies should be forced to release the weights after a reasonable time passed since they sold the service for the first time. Maybe after 3 years or so.
      It would be great for competition and security research.
- orbital-decay 46 days ago
  Yeah, I think it sometimes even repeats Gemini's injected platform instructions. It's pretty curious because a) Gemini uses something closer to the "chain of draft" and never repeats them in full naturally, only the relevant part, and b) these instructions don't seem to have any effect in GLM, it repeats them in the CoT but never follows them. Which is a real problem with any CoT trained through RL (the meaning diverges from the natural language due to reward hacking). Is it possible they used is in the initial SFT pass to improve the CoT readability?
- ImprobableTruth 47 days ago
  How is the raw Gemini 3 CoT accessed? Isn't it hidden?
  [-]
  - Tiberium 47 days ago
    There are tricks on the API to get access to the raw Gemini 3 CoT, it's extremely easy compared to getting CoT of GPT-5 (very, very hard).
    [-]
    - ceroxylon 46 days ago
      What are you referring to? I see the 'reasoning' in OpenRouter for GPT-5.2, I was under the impression that is the CoT.
      [-]
      - Tiberium 46 days ago
        Yes, that's exactly what I'm referring to. When you're using the direct Gemini API (AI Studio/Vertex), with specific tricks you can get the raw reasoning/CoT output of the model, not the summary.
  - bwat49 46 days ago
    in antigravity gemini sometimes inserts its CoT directly into code comments lol
polyrand 47 days ago
A few comments mentioning distillation. If you use claude-code with the z.ai coding plan, I think it quickly becomes obvious they did train on other models. Even the "you're absolutely right" was there. But that's ok. The price/performance ratio is unmatched.
[-]
- hashbig 46 days ago
  I had Gemini 3 Flash hit me this morning with "you're absolutely right" when I corrected it on a mistake it did. It's not conclusive of anything.
  [-]
  - polyrand 46 days ago
    That's interesting, thanks for sharing!
    It's a pattern I saw more often with claude code, at least in terms of how frequently it says it (much improved now). But it's true that just this pattern alone is not enough to infer the training methods.
  - theptip 46 days ago
    Or it’s conclusive of an even broader trend!
- ljosifov 46 days ago
  I imagine - and sure hope so - everyone trains on everything else. Distillation - ofc if one has bigger/other models providing true posterior token probabilities in the (0,1) interval (a number between 0 and 1), rather than 1-hot-N targets that are '0 for 200K-sans-this-token, and 1 for the desired output token' - one should use the former instead of the latter. It's amazing how as a simple as straightforward idea should face so much resistance (paper rejected) and from the supposedly most open minded and devoted to knowing (academia) and on the wrong grounds ('will have no impact on industry'; in fact - it's had tremendous impact on industry; better rejection wd have been 'duh it is obvious'). We are not trying to torture the model and the gpu cluster to be learning from 0 - when knowledge is already available. :-)
- Havoc 46 days ago
  >Even the "you're absolutely right" was there.
  I don't think that's particularly conclusive for training on other models. Seems plausible to me that the internet data corpus simply converges on this hence multiple models doing this.
  ...or not...hard to tell either way.
DeathArrow 46 days ago
I started to love cheap and fast models from China as they provide a lot of bang for the buck.
jared0x90 46 days ago
Out of curiosity is there a reason nobody seems to be trying it with factory.ai's Droid in these comments? Droid BYOK + GLM4.7 seems like a really cost effective backup in the little bit I have experimented with it.
[-]
- embedding-shape 46 days ago
  I don't know, never heard of factory.ai, but out of other curiosity, is there a particular reason you haven't commented since 2018/2019 but suddenly you're the second comment in all of HNs history to mention factory.ai in a comment?
swyx 46 days ago
> Preserved Thinking: In coding agent scenarios, GLM-4.7 automatically retains all thinking blocks across multi-turn conversations, reusing the existing reasoning instead of re-deriving from scratch. This reduces information loss and inconsistencies, and is well-suited for long-horizon, complex tasks.
does it NOT already do this? i dont see the difference. the image doesnt show any before/after so i dont see any difference
tonyhart7 47 days ago
less than 30 bucks for entire year, insanely cheap
(I know that people must pay it on privacy) but still for maybe playing around with still worth it imo
[-]
- sumedh 46 days ago
  Are you saying the reason they are offering it so cheap is because they are training on user data?
  [-]
  - gunalx 46 days ago
    They say they dont train on your inputs. But their tos dosent prevent them from doing that.
    My guess is they do train on slightly altered/obfuscated user data.
  - tonyhart7 46 days ago
    their model are cheap to run but not cheap $30 bucks a year
    so yeah its both
XCSme 47 days ago
Funny how they didn't include Gemini 3.0 Pro in the bar chart comparison, considering that it seems to do the best in the table view.
[-]
- jychang 47 days ago
  Also, funny how they included GPT-5.0 and 5.1 but not 5.2... I'm pretty sure they ran the benchmarks for 5.0, then 5.1 came out, so they ran the benchmarks for 5.1... and then 5.2 came out and they threw their hands up in the air and said "fuck it".
  [-]
  - rynn 46 days ago
    gpt-5.2 codex isn't available in the API yet.
    If you want to be picky they could've compared it against gpt-5 pro gpt-5.2 gpt-5.1 gpt-5.1-codex-max gpt-5.2 pro
    all depending on when they ran benchmarks (unless, of course, they are simply copying OAI's marketing).
    At some point it's enough to give OAI a fair shot and let OAI come out with their own PR, which they doubtlessly will.
  - XCSme 47 days ago
    I didn't even notice that, I assumed it was the latest GPT version.
  - amelius 47 days ago
    after or before running the benchmarks?
- guluarte 47 days ago
  Gemini is garbage and does it's own thing most of the time ignoring the instructions
larodi 47 days ago
From my limited exposure to these models, they seem very very very promising.
Alifatisk 46 days ago
Can't wait for the benchmarks at artifical analysis
pbiggar 46 days ago
Looking forward to getting these new models on Thaura.
maxdo 47 days ago
Funny enough they excluded 4.5 opus :)
zaiguru 46 days ago
I'm completely blown away by ZAI GLM 4.7.
Great performance for coding after I snatched a pretty good deal 50%+20%+10%(with bonus link) off.
60x Claude Code Pro Performance for Max Plan for the almost the same price. Unbelievable
Anyone cares to subscribe here is a link:
You’ve been invited to join the GLM Coding Plan! Enjoy full support for Claude Code, Cline, and 10+ top coding tools — starting at just $3/month. Subscribe now and grab the limited-time deal! Link：
https://z.ai/subscribe?ic=OUCO7ISEDB
zaiguru 46 days ago
I'm completely blown away by ZAI GLM 4.7.
Great performance for coding after I snatched a pretty good deal 50%+20%+10%(with bonus link) off.
60x Claude Code Pro Performance for Max Plan for the almost the same price. Unbelievable
Anyone cares to subscribe here is a link:
https://z.ai/subscribe?ic=OUCO7ISEDB
[-]
- emp17344 46 days ago
  This guy keeps spamming the same comment. Pretty sure this is a bot.
observationist 47 days ago
Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks.
Benchmarks aren't everything, but if you're going to contrast performance against a selection of top models, then pick the top models? I've seen a handful of companies do this, including big labs, where they conveniently leave out significant competitors, and it comes across as insecure and petty.
Claude has better tooling and UX. xAI isn't nearly as focused on the app and the ecosystem of tools around it and so on, so a lot of things end up more or less an afterthought, with nearly all the focus going toward the AI development.
$300/month is a lot, and it's not as fast as other models, so it should be easy to sell GLM as almost as good as the very expensive, slow, Grok Heavy, or so on.
GLM has 128k, grok 4 heavy 256k, etc.
Nitpicking aside, the fact that they've got an open model that is just a smidge less capable than the multibillion dollar state of the art models is fantastic. Should hopefully see GLM 4.7 showing up on the private hosting platforms before long. We're still a year or two from consumer gear starting to get enough memory and power to handle the big models. Prosumer mac rigs can get up there, quantized, but quantized performance is rickety at best, and at that point you look at the costs of self hosting vs private hosts vs $200/$300 a month (+ continual upgrades)
Frontier labs only have a few years left where they can continue to charge a pile for the flagship heavyweight models, I don't think most people will be willing to pay $300 for a 5 or 10% boost over what they can run locally.
[-]
- nl 46 days ago
  It seems like someone at X.ai likes maxing benchmarks but real world usage shows it significantly behind frontier models.
  I do appreciate their desire to be the most popular coding model on OpenRouter and offer Grok4-Fast for free. That's a notable step down from frontier models but fine for lots of bug fixing. I've put hundreds of millions of tokens through it.
- lame-robot-hoax 47 days ago
  Grok, in my experience, is extremely prone to hallucinations when not used for coding. It will readily claim to have access to internal Slack channels at companies, it will hallucinate scientific papers that do not exist, etc. to back its claims.
  I don’t know if the hallucinations extend to code, but it makes me unwilling to consider using it.
  [-]
  - ls612 46 days ago
    I had Grok write me a 150 line shell script which it nearly oneshot, except for the fact it made a one character typo in some file path handling code that took me an hour to diagnose. On one hand it’s so close to being really really good for coding, but on the other with this sort of errors (unlike other frontier models which have easily diagnosable error modes) it can be super frustrating. I’m hopeful we will see good things from Grok 5 in the coming months.
  - observationist 47 days ago
    Fair - it's gotten significantly better over the last 4 months or so, and hallucinations aren't nearly as bad as they once were. When I was using Heavy, it was excellent at ensuring grounding and factual statements, but it's not worth $100 more than ChatGPT Pro in capabilities or utility. In general, it's about the same as ChatGPT Pro - once every so often I'll have to call out the model making something up, but for the most part they're good at using search tools and ensuring claims get grounding and confirmation.
    I do expect them to pull ahead, given the resources and the allocation of developers at xAI, so maybe at some point it'll be clearly worth paying $300 a month compared to the prices of other flagships. For now, private hosts and ChatGPT Pro are the best bang for your buck.
    [-]
    - F7F7F7 46 days ago
      What are you doing with GPT Pro? I've compared it directly with Claude Max x20 and Google's premium offer. I just don't see myself ever leaving Claude Code as my daily driver. Codex is slow and opaque, albeit accurate. And Gemini is just super clumsy inside of it's CLI (and in OpenRouter) often confusing BASH and plans with actual output.
- Alifatisk 47 days ago
  In my experience, Grok 4 expert performs way worse then what the benchmarks say.
  I’ve tried it with coding, writing and instructions following. The only thing it excels at currently and searching for things across the web is+ twitter.
  Otherwise, I would never use it for anything else. At coding, it always includes an error, when it patches it, it introduces another one. When writing creative text and had to follow instructions, it hallucinates a lot.
  Based on my experience, I am suspecting XAI for bench-maxing on Artificial Analysis because no way Grok 4 expert performs close to Gpt-5.2, Claude sonnet 4.5 and Gemini 3 pro
  [-]
  - Alifatisk 46 days ago
    Excuse my grammar error, I wrote this shortly before falling asleep
- kristianp 47 days ago
  Perhaps people are steering clear of grok due to its extremist political training.
  [-]
  - observationist 47 days ago
    This is a silly meme.
    [-]
    - knowsuchagency 47 days ago
      Mecha hitler
      [-]
      - observationist 47 days ago
        Yes, an adventure in public facing bots that can pull from trending feeds, self referential system prompts, minimal guardrails, and that poor fellow Will Stancil.
        The absence of guard rails is a good thing - what happened with mechahitler was a series of feature rollouts that combined with Pliny trending, resulting in his latest grok jailbreak ending up in the prompt, followed by the trending mechahitler tweets, and so on. They did a whole lot of new things all at once with the public facing bot, and didn't consider unintended consequences.
        I'd rather a company that has a mechahitler incident and laughs it off than a company that pre-emptively clutches pearls on behalf of their customers, or smugly insists that we should just trust them, and that their vision of "safety" is best for everyone.
        [-]
        zamalek 46 days ago
        Unfortunately grok doesn't even meet that bar anymore. There was the very recent incident where it claimed Musk was the best at everything, so xAI are clearly not beyond baking in intentional bias/clutching pearls.
        https://techcrunch.com/2025/11/20/grok-says-elon-musk-is-bet...
        bigyabai 46 days ago
        > The absence of guard rails is a good thing
        It's really not. I have no axe to grind with Elon, but X and it's reputation for "oops we made a mistake" critical failures is a no-go. I don't feel safe signing up to try whatever their free model when their public image is nonstop obvious mistakes. There is no world where I'm bringing those models to work, and explaining to HR why my web traffic included a Mechahitler response (or worse).
        Anthropic and OpenAI are Silicon Valley circuses in a relative sense, but they take this stuff seriously and make genuine advancements. XAI could disappear tomorrow and the human race would not lose any irreplaceable research. It's a dedicated fart-huffing division on the best of days, I hope you're not personally invested in their success.
- claudiug 47 days ago
  every time i use grok is get some bad results. basically is all 1000% perfect from his point of view, review the code... "bollocks" methods that dont exists or just one line of code or method created with a nice comment: //#TODO implement
- Madmallard 46 days ago
  " Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks."
  I think these types of comments should just be forbidden from Hacker News.
  It's all feelycraft and impossible to distinguish from motivated speech.
- guluarte 46 days ago
  Opus > Codex > Gemini in my opinion, grok is not even close