Phi-4 Bug Fixes

(unsloth.ai)

60 points | by danielhanchen 4 hours ago

6 comments

  • danielhanchen 4 hours ago
    Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini

    1. End of sentence should be <|im_end|> not <|endoftext|>

    2. Chat template should not auto add an assistant prompt

    3. Padding token should not be EOS but <|dummy_87|>

    I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth

    I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...

    • simonw 2 hours ago
      Huh! That may explain why I kept on getting visible <|im_end|> output when I tried running a Phi-4 GGUF file using llama.cpp.
      • danielhanchen 1 hour ago
        Oh yes exactly! I trimmed it out now :)

        The better chat template should be:

        {% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}

    • CGamesPlay 1 hour ago
      > We converted Phi-4 to Llama’s architecture for better accuracy and easier use.

      What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?

      • danielhanchen 46 minutes ago
        Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.

        Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.

        So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.

        The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.

      • Sn0wCoder 1 hour ago
        Would guess GGUF so you can run on llama.cpp, LM Studio, etc..., but OP can hopefully clarity further for you.
        • danielhanchen 45 minutes ago
          Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!
    • sunaookami 2 hours ago
      Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft just doesn't care.

      >to be on par with GPT-4o mini

      Phi is known to overfit benchmarks. It's way, way worse then that.

    • sroussey 43 minutes ago
      Can you convert to ONNX so I can try in web browser?
  • t1amat 2 hours ago
    Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.

    Unsloth is a masterpiece, keep up the great work!

  • lostmsu 2 hours ago
    The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4

    According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.

    • danielhanchen 1 hour ago
      Oh yes I found this to be a bit strange - I uploaded our versions and Microsoft's own version to Hugging Face's public LLM leaderboard - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

      You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.

      It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue

  • adultSwim 1 hour ago
    Are there alternatives to unsloth?

    I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.

    Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.

    • danielhanchen 44 minutes ago
      Multi GPU support is definitely coming to Unsloth OSS! Our goal was to release it this month, but unsure on exact timelines - maybe next month!!
  • make3 1 hour ago
    "Yes it improves performance!" proceeds to show the most unconvincing stats ever

    you can probably blow on your GPU and get a similar performance change

  • TZubiri 1 hour ago
    Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.