Multilingual Performance

v2.4.0 | Report Errata

docs development docs development

Most widely available embedding models perform best on English-language text. For high-risk systems deployed across multiple EU member states, uneven embedding performance across languages could cause the system to retrieve more relevant information for queries in some languages than others. This constitutes a materially different quality of service to users in different member states.

The Technical SME evaluates the embedding model’s retrieval performance across all languages in which the system will operate. The evaluation should use language-specific retrieval benchmarks (MIRACL for multilingual information retrieval, MTEB for multilingual text embedding evaluation) and domain-specific test queries in each language. Performance gaps exceeding a defined threshold are documented as known limitations.

Compensating controls for multilingual performance gaps include language-specific fine-tuning of the embedding model, translation preprocessing for underperforming languages (translating queries into the language where the model performs best before retrieval, then translating results back), or deploying separate embedding models optimised for specific language families. Each approach carries trade-offs: translation introduces latency and potential meaning loss; separate models increase infrastructure complexity.

The AI Governance Lead, in consultation with the Technical SME, sets the performance gap threshold. The threshold should reflect the deployment context: a system deployed only in English and French may tolerate larger gaps than a system serving all 24 official EU languages.

Key outputs

Multilingual retrieval performance evaluation results
Performance gap identification per language
Compensating controls for underperforming languages