How to run Qwen 3.5 locally

(unsloth.ai)

101 points | by Curiositry 8 hours ago

4 comments

  • moqizhengz 2 hours ago
    Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.
    • throwdbaaway 1 hour ago
      There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.
      • codemog 30 minutes ago
        Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?

        Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.

        • otabdeveloper4 14 minutes ago
          There's diminishing returns bigly when you increase parameter count.

          The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.

          Anyways your perception of a model's "quality" is determined by careful post-training.

          • zozbot234 1 minute ago
            More parameters improves general knowledge a lot, but you have to quantize higher in order to fit in a given amount of memory, which if taken to extremes leads to erratic behavior. For casual chat use even Q2 models can be compelling, agentic use requires more regularization thus less quantized parameters and lowering the total amount to compensate.
      • zozbot234 21 minutes ago
        With MoE models, if the complete weights for inactive experts almost fit in RAM you can set up mmap use and they will be streamed from disk when needed. There's obviously a slowdown but it is quite gradual, and even less relevant if you use fast storage.
      • teaearlgraycold 43 minutes ago
        Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run?

        On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.

        • ece 21 minutes ago
          The 27B is rated slightly higher for SWE-bench.
    • yangikan 2 hours ago
      Do you point claude code to this? The orchestration seems to be very important.
      • teaearlgraycold 34 minutes ago
        I loaded Qwen into LM Studio and then ran Oh My Pi. It automatically picked up the LM Studio API server. For some reason the 35B A3B model had issues with Oh My Pi's ability to pass a thinking parameter which caused it to crash. 27B did not have that issue for me but it's much slower.

        Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...

        The 35B model is still pretty slow on my machine but it's cool to see it working.

    • lukan 1 hour ago
      What exact model are you using?

      I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?

      • vasquez 8 minutes ago
        It depends on the task, but you generally want some context. These models can do things like OCR and summarize a pdf for you, which takes a bit of working memory. Even more so for coding CLIs like opencode-ai, qwen code and mistral ai.

        Inference engines like llama.cpp will offload model and context to system ram for you, at the cost of performance. A MoE like 35B-A3B might serve you better than the ones mentioned, even if it doesn't fit entirely on the GPU. I suggest testing all three. Perhaps even 122-A10B if you have plenty of system ram.

        Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).

        Larger models tolerate lower quants better than small ones, 27B might be usable at 3 bpw where 9B or 4B wouldn't. You can also quantize the context. On llama.cpp you'd set the flags -fa on, -ctk x and ctv y. -h to see valid parameters. K is more sensitive to quantization than V, don't bother lowering it past q8_0. KV quantization is allegedly broken for Qwen 3.5 right now, but I can't tell.

  • Curiositry 2 hours ago
    Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).
    • acters 1 hour ago
      I have a 1660ti and the cachyos + aur/llama.cpp-cuda package is working fine for me. With about 5.3 GB of usable memory, I find that the 35B model is by far the most capable one that performs just as fast as the 4B model that fits entirely on my GPU. I did try the 9B model and was surprisingly capable. However 35B still better in some of my own anecdotal test cases. Very happy with the improvement. However, I notice that qwen 3.5 is about half the speed of qwen 3
    • WhyNotHugo 1 hour ago
      If you’re building from source, the vulkan backend is the easiest to build and use for GPU offloading.
      • Curiositry 55 minutes ago
        Yes, that's what I tried first. Same issue with trying to allocate more memory than was available.
  • Twirrim 5 hours ago
    I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.
    • fy20 2 hours ago
      I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.
      • manmal 54 minutes ago
        In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously
    • ufish235 4 hours ago
      Can you give an example of some coding tasks? I had no idea local was that good.
      • hooch 2 hours ago
        Changed into a directory recently and fired up the qwen code CLI and gave it two prompts: "so what's this then?" - to which it had a good summary across stack and product, and then "think you can find something todo in the TODO?" - and while I was busy in Claude Code on another project, it neatly finished three HTML & CSS tasks - that I had been procrastinating on for weeks.

        This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA.

        • manmal 50 minutes ago
          3.5 seems to be better at coding than 3-coder-next, I’d check it out.
    • fragmede 4 hours ago
      Which models would that be?
  • vvram 33 minutes ago
    What would be optimal HW configurations/systems recommended?