I have been working on a problem most language detection libraries quietly fail at: short, messy, conversational text. The kind you see in chat apps, support tickets, SMS, and mixed-language messages.
FastLangML is my attempt to fix that.
It is a multi-backend ensemble (FastText, Lingua, langdetect, pyCLD3, and others) with a voting layer built for real-world text. It handles:
Short messages with almost no statistical signal
Code switching like Hinglish or Spanglish
Slang, abbreviations, and emojis
Multi-turn conversations where context matters
Confusable languages like ES vs PT or NO vs DK vs SV
A few design choices:
Context-aware detection so you can pass conversation history and get more stable predictions
A hinting system for slang, abbreviations, and custom rules
Extensible backends so you can plug in your own detectors or voting logic
Optional persistence using Redis or disk for multi-turn conversations
Support for more than 170 languages across the ensemble
Why I built it: most detectors are tuned for long, clean text. They break on "ok", "jaja", "mdr", "brooo", or anything with mixed languages. I needed something that works on real chat data, not idealized text.
I would love feedback from HN on:
How you evaluate language detection quality in production
Whether context-aware detection helps in your workflows
sachuin23•1h ago
FastLangML is my attempt to fix that.
It is a multi-backend ensemble (FastText, Lingua, langdetect, pyCLD3, and others) with a voting layer built for real-world text. It handles:
Short messages with almost no statistical signal
Code switching like Hinglish or Spanglish
Slang, abbreviations, and emojis
Multi-turn conversations where context matters
Confusable languages like ES vs PT or NO vs DK vs SV
A few design choices:
Context-aware detection so you can pass conversation history and get more stable predictions
A hinting system for slang, abbreviations, and custom rules
Extensible backends so you can plug in your own detectors or voting logic
Optional persistence using Redis or disk for multi-turn conversations
Support for more than 170 languages across the ensemble
Why I built it: most detectors are tuned for long, clean text. They break on "ok", "jaja", "mdr", "brooo", or anything with mixed languages. I needed something that works on real chat data, not idealized text.
I would love feedback from HN on:
How you evaluate language detection quality in production
Whether context-aware detection helps in your workflows
Ideas for improving code switching accuracy
Additional backends worth integrating
Repo: https://github.com/pnrajan/FastLangML
Happy to share benchmarks, architecture notes, or design tradeoffs if people are interested.