Happy to answer questions if anyone is interested in building translation models for low-resource languages, without being a GPT wrapper. A great resource for this is Marian-NMT[1] and the Opus & Tatoeba projects (beware of data quality).
* Unfortunately not functioning right now due to inference costs for the model, but I plan to launch it sometime soon.
Manual proof-reading (and data generation) was a big part of it, it's definitely not a glamorous magic process. But as I went through it I could notice patterns and wrote some tools to help.
There's a way to leverage LLMs to help with this if your language is supported (my target wasn't at the time), but I still strongly recommend a manual review part. That's really the secret sauce and no way around it if you're serious about the translation quality of your model.
My goal was a google translate style multilingual translation model, and for that the BART architecture proved ultimately to be better because you benefit from cross-language transfer learning. If your model learns the meaning of "car" in language pair (A, B), and it knows it in language (B, C), then it will perform decently when you ask it to translate between A and C. It compounds very quickly the more you add language pairs.
One big limitation of BART (where LLMs become more attractive) is that it becomes extremely slow for longer sentences, and is less good at understanding and translating complex sentences.
> Any special processing needed for Arabic compared to Latin-script-based
Yes indeed, quite a lot. Especially for Moroccan Arabic which is written in both Arabic and Latin scripts (I made sure to support both and they're aligned in the model's latent space). For this I developed semantic and phonetic embedding models along the way that helped a lot. I am in the process of publishing a paper on the phonetic processing aspect, if you're interested let's stay in touch and I'll let you know when it's out.
But beyond the pre-processing and data pipeline, the model itself didn't need any special treatment besides the tokenizer.
Asking because I built a translator app[0] for Android, using marian-nmt (via bergamot), with Mozilla's models, and the performance for on-device inference is very good.
With that said while running it client-side is indeed an option, openly distributing the model is not something I would like to do, at least at this stage. Unlike the bigger projects in the NMT space, including Marian and Bergamot, I don't have any funding, and my monetization plan is to offer inference via API[0].
Note that you have the larger model, if you wanted a smaller model for just one language pair, I guess you could use distillation?
For the continuous training itself, yes I simply continue training the model from the last checkpoint (cosine lr scheduler). I am considering doing a full retraining at some point when I collect enough data to compare with this progressive training.
Apologies for the poor links, it takes a lot of time to work on this let alone fully document everything.
Maybe translate X to English, and then to Y?
The technique makes sense though, but in the training data stage mostly. BART-style translation models already represent concepts in latent space regardless of the input-output language sidestepping English entirely so you have something like:
`source lang —encoded into-> latent space —decoded into—> target lang`
Works great to get translation support for arbitrary language combinations.
I'll beginning to integrate it into my user-facing application for language learners soon: www.abal.ai
GPT4.1 only cost $2 per 1M input tokens and $8 per 1M output tokens.
LLM translation have been cheaper and better than deepl for a while.
My one issue is that the author does not try to think about ways Google translate is better. It's all about model size. Google Translate models are around 20mb when run local on a phone. That makes them super cheap to run and can be done offline on a phone.
I'm sure Gemini could translate better than Google Translate, but Google is optimizing for speed and compute. It's why they will allow free translation of any webpage in Chrome.
DiscourseFan•10h ago
GaggiX•10h ago
djvdq•1h ago
- I built a new super-app!
- You built it, or is it just another GPT wrapper?
- ... another wrapper
https://preview.redd.it/powered-by-ai-v0-d8rnb2b0ynad1.png