This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: jinruiy@student.unimelb.edu.au);
(2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn)@unimelb.edu.au);
(3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.
Table of Links
Conclusion, Limitations, Ethics Statement, Acknowledgements, References, and Appendix
5 Language Bias Discussion
In light of our findings in a one-vs-many setting, we were keen to delve further into the underlying causes of the disparity between languages.
5.1 Bias Detection
Language bias is likely if the query language aligns better with one document language than another. As mentioned earlier, Pyserini supports different tokenizers, specifically language-specific tokenizers or simple whitespace tokenization. Therefore, in the one-vs-many setting, we analyze the composition of the top-100 rankings for the 100 topics. During indexing of the document collection, we used the simple whitespace tokenizer, given the multilingual nature of the collection. However, over the queries during retrieval, we employed two different tokenizers — a language-specific tokenizer, and the whitespace tokenizer.
We conducted a correlation analysis between the language of the topics and the language of the top 100 relevant documents. From Table 2, we can see that relevance judgments in our test cases are consistent across languages, ensuring uniformity in the correlation matrix within the test set. However, Figure 2 reveals that both approaches generate strong language bias. In both cases, the query language aligns better with documents in its own language than others. The right plot appears to show that languages from the same family has strong correlation (e.g., PL, CS) and (IT, ES) since they may have some shared vocabulary.
5.2 Collection Distribution Factors
Initially, we hypothesized that the disparity for each language may be a contributing factor to this bias. Figure 3 presents the regression line between the number of documents in a given language and MRR, which explains much of the variation across languages.
However, note the outlier above the regression line (Polish: PL), which has a substantial number of documents but surprisingly low MRR performance. We refer to this phenomenon as a “BM25 unfriendly” language. According to Wojtasik et al. (2023), the main reason for the low performance of Polish lies in its highly-inflected morphology, giving rise to a a multitude of word forms per lexeme, including inflections of proper names, and complex morphological structure. In such cases, lexical matching is less effective than in other morphologically-simpler languages. Furthermore, LUCENE 8.5.1 API does not have a language-specific tokenizer for Polish. Conversely, languages below the regression line can be termed “BM25 friendly” languages, as they require fewer documents to achieve higher MRR in retrieval.
5.3 Language Tokenizer Factors
Secondly, we speculated that the choice of language-specific Analyzer in LUCENE might be a contributing factor, as it influences word tokenization, token filter, synonym expansion and other processing. [7] To investigate this, we conducted a controlled experiment in the one-vs-many setting. When indexing the collection, given the multilingual nature of the collection, we employed whitespace as the tokenizer. However, over the queries, we experimented with either a language-specific tokenizer or whitespace tokenizer. We then compared the linear regression of MRR against the number of documents in Figure 3. On the right side of the plot, we can see a strong correlation when using whitespace tokenization for both the collection and the queries, reducing language bias.
Furthermore, when transitioning from languagespecific tokenizers to whitespace tokenizers, the overall MRR across all languages declined modestly, from 15.02 to 14.18. That is, the original performance level was largely preserved, but language bias was diminished in using simple whitespace tokenization.
[7] https://lucene.apache.org/core/8_0_0/core/ org/apache/lucene/analysis/package-summary.html# package.description