> We converted Phi-4 to Llama’s architecture for better accuracy and easier use.
What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?
Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.
Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.
So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.
The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.
Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!
Anecdotally, I've been experimenting with Phi-4 the past hour or so (so, yeah, not very comprehensive) and it's certainly a strong model. Definitely better than the previous Phi models.
According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.
You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.
It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue
I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.
Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.
The multiple bug fixes are separate from the finetuning sections - Unsloth itself makes finetuning 2x faster and use 70% less memory - the bug fixes are totally detached from finetuning - ie you can take the fixed version we uploaded at https://huggingface.co/unsloth/phi-4, and use it in any framework or inference engine.
Anecdotal evidence was provided to show some Redditors tested it out - but I do agree it's not correct to show that as an example - so I uploaded our fixed versions to Hugging Face's public LLM leaderboard here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... - this shows the fixes do in fact work!
1. End of sentence should be <|im_end|> not <|endoftext|>
2. Chat template should not auto add an assistant prompt
3. Padding token should not be EOS but <|dummy_87|>
I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth
I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...
The better chat template should be:
{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}
What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?
Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.
So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.
The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.
>to be on par with GPT-4o mini
Phi is known to overfit benchmarks. It's way, way worse then that.
Phi-3's sliding window should be 2048 and not 2047, and they also had chat template issues - I uploaded correct versions to https://huggingface.co/unsloth/Phi-3.5-mini-instruct
https://huggingface.co/spaces/webml-community/phi-3.5-webgpu
Unsloth is a masterpiece, keep up the great work!
According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.
You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.
It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue
I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.
Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.
you can probably blow on your GPU and get a similar performance change
I agree it's not super convincing, so I provided anecdotal evidence as well - I'll work with the Phi-4 team to upstream these fixes!
PS for further credibility, we also fixed 8 bugs in Gemma 1 - see https://x.com/danielhanchen/status/1765446273661075609 , multiple bugs in Llama, Mistral, Qwen and other models
- blowing on a GPU (which I take to mean doing roughly nothing)
- gets roughly the same perf change
- as moving from fp16 to q4
The multiple bug fixes are separate from the finetuning sections - Unsloth itself makes finetuning 2x faster and use 70% less memory - the bug fixes are totally detached from finetuning - ie you can take the fixed version we uploaded at https://huggingface.co/unsloth/phi-4, and use it in any framework or inference engine.
Apologies I'm confused on the comment sorry.
If you're questioning the credibility of the bug fixes - we fixed 8 bugs in Gemma https://x.com/danielhanchen/status/1765446273661075609, multiple bugs in Llama, Mistral, Qwen, a gradient accumulation bug https://x.com/danielhanchen/status/1846235913443262891 and much more