"We note, however, that there is, unfortunately, a serious risk that Voyage’s models were trained on some of the evaluation sets in MLEB, particularly SCALR and Consumer Contracts QA, which are both also part of MTEB, due to the fact that Voyage trains on their customers’ private data by default (which would invariably include benchmarks). This is also a risk for Cohere and Jina models."
Wow.
ubutler•3h ago
We were unfortunately disappointed to discover that, yes, Voyage, Cohere, and Jina all train on the data of their API customers by default.
Voyage's terms say:
> you grant Voyage AI (and its successors and assigns) a worldwide, irrevocable, perpetual, royalty-free, fully paid-up, right and license to use, copy, reproduce, distribute, prepare derivative works of, display and perform the Customer Content: ... (iii) to train, improve, and otherwise further develop the Service (such as by training the artificial intelligence models we use).
Cohere's terms say:
> YOU GRANT US A ... RIGHT TO ... USE ... ANY DATA ... TO ... IMPROVE AND ENHANCE THE COHERE SOLUTION AND OUR OTHER OFFERINGS AND BENCHMARK THE FOREGOING, INCLUDING BY SHARING API DATA AND FINETUNING DATA WITH THIRD PARTIES ...
Jina's terms say:
> Jina AI shall, subject to applicable mandatory data protection requirements, be entitled to retain data uploaded to the Jina AI Systems or otherwise provided by the Customer or collected by Jina AI in the course of providing the Services and to use such data in anonymized/pseudonymized format for its business purposes including to improve its artificial intelligence applications.
abksaai•3h ago
This is the most interesting part of this article.
afistfullof•3h ago
Wow.
ubutler•3h ago
Voyage's terms say:
> you grant Voyage AI (and its successors and assigns) a worldwide, irrevocable, perpetual, royalty-free, fully paid-up, right and license to use, copy, reproduce, distribute, prepare derivative works of, display and perform the Customer Content: ... (iii) to train, improve, and otherwise further develop the Service (such as by training the artificial intelligence models we use).
Cohere's terms say:
> YOU GRANT US A ... RIGHT TO ... USE ... ANY DATA ... TO ... IMPROVE AND ENHANCE THE COHERE SOLUTION AND OUR OTHER OFFERINGS AND BENCHMARK THE FOREGOING, INCLUDING BY SHARING API DATA AND FINETUNING DATA WITH THIRD PARTIES ...
Jina's terms say:
> Jina AI shall, subject to applicable mandatory data protection requirements, be entitled to retain data uploaded to the Jina AI Systems or otherwise provided by the Customer or collected by Jina AI in the course of providing the Services and to use such data in anonymized/pseudonymized format for its business purposes including to improve its artificial intelligence applications.
abksaai•3h ago