By validating Language Identification data (LangID or LID): https://dynabench.org/tasks/text-language-identification
By contributing urls for our seed crawl: https://github.com/commoncrawl/web-languages
We're also organizing a Workshop on Multilingual Data Quality Signals (WMDQS) with MLCommons and EleutherAI where we have a call for papers open (https://wmdqs.org/cfp/) and a upcoming shared task on language identification (https://wmdqs.org/shared-task/)