Data Efficient Dense Cross-Lingual Information Retrieval

This article is a preprint and has not been peer-reviewed.

For citation:
Show BibTeX format

Chen, L. et al. "Data Efficient Dense Cross-Lingual Information Retrieval." GitData Archive, vol. 2024, no. 12, Dec. 2024, https://archive.gd.edu.kg/20241230074419/

Abstract:

Cross-Lingual Information Retrieval (CIR) remains challenging due to limited annotated data and linguistic diversity, especially for low-resource languages. While dense retrieval models have significantly advanced retrieval performance, their reliance on large-scale training datasets hampers their effectiveness in multilingual settings. In this work, we propose two complementary strategies to improve data efficiency and robustness in CIR model fine- tuning. First, we introduce a paraphrase-based query augmentation pipeline leveraging large language models (LLMs) to enrich scarce training data, thereby promoting more robust and language-agnostic representations. Second, we present a weighted InfoNCE loss that emphasizes underrepresented languages, ensuring balanced optimization across heterogeneous linguistic inputs. Experiments on cross-lingual benchmark datasets demonstrate that our combined approaches yield substantial gains in retrieval quality, outperforming standard training protocols on small and imbalanced datasets. These results underscore the potential of targeted data augmentation and reweighted objectives to build more inclusive and effective CIR systems, even under resource constraints.

License:

This work is licensed under CC BY 4.0.