AMÁLIA and the future of European Portuguese LLMs

(duarteocarmo.com)

106 points | by johnbarron 3 days ago

10 comments

mariopt 3 hours ago
This model is a waste of Public Funds.
There is no public website to use it, be it free or paid, the dataset is not public, the code is not public (The github URL in the article returns 404 ), the claimed model intelligence is so low that is pretty much useless at 32K context and massively inferior to GPT‑4o.
As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.
You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.
It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.
[-]
- upupupandaway 2 hours ago
  As a pt-BR speaker from across the pond: https://soberania.ai/
  Similar waste.
- dr_dshiv 3 hours ago
  It’s a way to suck all the money out of the room in the name of nationalism — and it’s all over Europe. Only idea everyone has had.
- vova_hn2 2 hours ago
  I'm not arguing with the rest of your points, but...
  > Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset
  I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.
  And I don't think that it is possibple to replace the tokenizer without full retraining.
  [-]
  - mcyc 1 hour ago
    You are right about most tokenizers being heavily biased towards English, but the situation is not so bad for Portuguese. Here are some results on the Goldfish corpus [1] with a few different tokenizers. This measures #characters in corpus / #subwords in tokenized corpus.
```
Llama3
english, 0.216
portuguese, 0.285
italian, 0.287
greek, 0.592
```
```
Gemma4
english, 0.219
portuguese, 0.246
italian, 0.249
greek, 0.537
```
```
Kimi2.6
english, 0.214
portuguese, 0.310
italian, 0.308
greek, 0.716
```
    Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).
    On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.
    So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.
    -----------------------
    [1]: https://huggingface.co/datasets/goldfish-models/fish-food
    [2]: https://arxiv.org/abs/2404.17790
pu_pe 6 hours ago
I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
[-]
- iugtmkbdfil834 4 hours ago
  I agree, the research is complex enough as is without having to worry about splitting it babel-like into multiple languages.
- TheMagicHorsey 3 hours ago
  This is the way.
  Sovereign SOTA models might also be possible with nation-state involvement. But this is a good stopgap.
- dyauspitr 4 hours ago
  Yeah I think India is going the better route with Sarvam which is trained from scratch and still relatively cheap.
alexaholic 1 hour ago
The Amália model is not yet publicly available. Until it's ready, one can fool around with Anália at https://analia.pt
swiftcoder 6 hours ago
It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.
[-]
- bobthepanda 4 minutes ago
  I would have to imagine this might not actually be as bad as it seems, at the very least there should be a giant corpus of translated EU texts.
- embedding-shape 6 hours ago
  I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.
  All in all, I don't think that's a major issue here.
  [-]
  - swiftcoder 6 hours ago
    The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).
    I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
    [-]
    - madaxe_again 6 hours ago
      Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.
      Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.
    - philipwhiuk 5 hours ago
      > I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).
      That's easy to say when you're not on the other end of US defaultism.
      [-]
      - augusto-moura 4 hours ago
        To be fair, it is only natural: Portuguese itself only came to be because the Roman Empire conquered the Lusitan land [1], a lot of English comes from Norman French from the Norman conquest [2], the Americas didn't speak European languages until 500 years ago or so, etc.
        If you give enough time, all languages will change, and some of them because of major political changes/conquests
        [1]: https://en.wikipedia.org/wiki/Paleohispanic_languages
        [2]: https://en.wikipedia.org/wiki/Influence_of_French_on_English
        [3]: https://en.wikipedia.org/wiki/Indigenous_languages_of_the_Am...
      - swiftcoder 3 hours ago
        > That's easy to say when you're not on the other end of US defaultism.
        I mean, I’m a Brit who lived a long time in the US, so that’s a dynamic with which I am rather familiar
  - mghackerlady 5 hours ago
    Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources
  - evandrofisico 2 hours ago
    Portugal has a growing Xenophobic attitude towards immigrants, specially Brazilians and this is reflected in linguistic prejudice.
    They have concerns of portuguese children learning to "speak brazillian" because there is a lot more of video content being produced in Brasil than in Portugal and stuff like movies, videogames and software in general are avaliable in brazilian localization/adaptation first.
    [-]
    - embedding-shape 2 hours ago
      We have the same thing happening, on multiple levels, here too. First some Spanish parents are afraid the children aren't listening and watching enough Spanish media. Then additionally, some Catalan parents are afraid the children don't get to use Catalan in school so they don't become proficient enough to use it in society.
      [-]
      - darkwater 2 hours ago
        The Catalan situation is completely different and unrelated, being a completely different language and not endangered (with or without scary quotes, as you prefer) by an ex-colony that became independent. Actually many Catalans would like to be such ex-colony.
        [-]
        embedding-shape 1 hour ago
        > The Catalan situation is completely different and unrelated
        I'm not saying it's the same, but there is definitively similarities in that parents are worrying about what language their children use. And yeah, unrelated, wasn't trying to claim it's the same or better/worse or anything, just another similar situation other (curious) people might want to learn more about, regardless of what you think Catalan wants or not.
  - KK7NIL 6 hours ago
    The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.
    [-]
    - embedding-shape 6 hours ago
      Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.
      [-]
      - KK7NIL 6 hours ago
        What's your evidence for that?
        And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?
        [-]
        embedding-shape 6 hours ago
        Evidence? Not so much, I didn't realize I was defending a PhD thesis here.
        I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.
        > And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English
        I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.
        [-]
        KK7NIL 5 hours ago
        > I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.
        I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)
        > Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.
        Not only are your reasons not obvious, your conclusion is actually wrong.
        If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.
        LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).
        [-]
        embedding-shape 2 hours ago
        > If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals)
        Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish.
        Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?
  - madaxe_again 6 hours ago
    Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.
    [-]
    - embedding-shape 6 hours ago
      I agree, they're not the same. But they're far closer than other languages who don't come from the same families.
- fy20 5 hours ago
  European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.
  https://en.wikipedia.org/wiki/List_of_languages_by_number_of...
  [-]
  - SkeuomorphicBee 2 hours ago
    What makes Portugal's situation unique is that it is a small population that is eclipsed in models by the bigger weights of the much bigger population of Brazil.
    Yes, there are much smaller European countries, but those are generally the only source of truth for their specific language, so the context of a LLM query in that language steers the LLM towards facts from that country, for example, if I ask a big generic LLM something in Latvian then it most likely will answer something relevant to the context of Latvia. But Portugal, being the much smaller user of its language, have the somewhat unique problem that if I ask a generic model something in Portuguese it will probably answer something related to Brazil instead of Portugal.
    Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.
  - augusto-moura 4 hours ago
    It is pretty small when considering content output. It is only 11 million people, and only a fraction of them will be writing something that could be used on training datasests. If you look at the countries by scientific contribution, for example [1], Portugal is on the 28th position, while Brazil is in 14th by more than double the number of contributions.
    Don't get me wrong, it is definitely impressive given Portugal's actual size, but I believe there's a hard limit for population and size that will be difficult to cross
    [1]: https://en.wikipedia.org/wiki/List_of_countries_by_number_of...
  - depaulagu 5 hours ago
    > European Portuguese is the 13th most populous language in Europe
    that's not impressive
    [-]
    - senko 4 hours ago
      Hello from 23rd
drivebyhooting 1 hour ago
I’ve noticed that ChatGPT is noticeably dumber in languages other than English. It even will confidently repeat common but wrong superstitions from the target language as if they were fact.
r2ob 2 hours ago
"This model is a waste of Public Funds". There is no "public funds", this is a waste of money from the tax payers.
mt_ 3 hours ago
5 million for a llama-2 finetune, how is that impressive?
algoth1 6 hours ago
Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?
[-]
- kinow 49 minutes ago
  That idea is different than what most are talking here in other comments.
  The grammar and vocabularies don't match, but I think the worst are the expressions. Both sides have *a lot* of expressions that vary per context and location.
hartator 7 hours ago
What a waste of time and money.
Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.
[-]
- embedding-shape 7 hours ago
  What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.
  Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
  [-]
  - Miraste 6 hours ago
    To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.
    [-]
  - numpad0 6 hours ago
    yeah, they seem all confined to being an American-consultant-Chinese-authoritarian split personality with broad second language capabilities. I suppose they become too incoherent otherwise.
  - cess11 6 hours ago
    E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.
    It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
- CrimsonRain 4 hours ago
  Europe always has a thing for their languages. They think many languages make them stronger while spending billions in system loss due to communication barriers. It is obvious they will try to do the same with LLMs and call it the next best thing since bread and butter.
  I went to JCON EUROPE this year. The amount of "Europe this" "Europe that" "sovereign this, sovereign that" is mind boggling and just a waste of time and money. The regular people know this and thus left the conferences mid way. But somehow the people "in charge" really need to push this. Same thing here.
  [-]
  - lmf4lol 3 hours ago
    whats your suggestion? we just eradicate all of our culture and languages and go full on english ?
    whats wrong with exploring ways to keep national languages alive in the LLM area
    [-]
    - joe_mamba 28 minutes ago
      > we just eradicate all of our culture
      Already happening via low birth rates and mass migration. If you don't have kids, there will be nobody to carry your culture forwards.
      >and go full on english ?
      Nobody is saying you have to swap your culture for English. You can have English as the mandatory language for tech and business across the EU, while still keeping your language and culture for your education, leisure, festivities, art, media, etc. This way everyone is happy. But countries like France would rather detonate its entire nuclear arsenal rather than accepting official use of English on its own soil.
      As long as resources are spent across the EU to account for every language and bureaucracy, we'll keep falling behind internationally, and the only winners will be the bureaucrats, notaries, lawyers, consultants, translators, etc. We need another Concord moment. What's wild is that Concord was made before the EU was even a thing.
- KK7NIL 6 hours ago
  This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.
  [-]
  - xp84 1 hour ago
    It's the European Way
  - lmf4lol 3 hours ago
    everyone on this project probably learned a lot doing it, dont you think!
    [-]
    - joe_mamba 11 minutes ago
      I'd also want to get paid to work on stuff not meant to bring any financial returns, just to learn and pad my resume. Sounds like a sweet gig. Where do I sign up?
- mistrial9 7 hours ago
  > makes you missed out on most of the world knowledge
  and, who knows what will happen to grammar ?
- clear-octopus 7 hours ago
  [dead]
simianwords 5 hours ago
Domain specific models will never be a thing. You don't get generalised intelligence with that.
https://simianwords.bearblog.dev/why-domain-specific-llms-wo...