This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: jinruiy@student.unimelb.edu.au);
(2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn)@unimelb.edu.au);
(3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.
Table of Links
Conclusion, Limitations, Ethics Statement, Acknowledgements, References, and Appendix
3 Multi-EuP
In our approach, we consider the debate topics to be the queries, and the text of each individual speech delivered by an MEP to be a document.
Topics The topics are officially annotated by the EU, and professionally translated into 24 different languages.[5] During preprocessing, we filter out procedural debate topics such as agenda, leaving 1.1K unique topics. They will serve as a valuable resource for assessing language bias in multilingual ranking methods, given that all the topics across different languages are semantically consistent.
Documents The 22K multilingual documents within the Multi-EuP dataset originate from MEP speeches during parliamentary debates. Each document annotated with additional metadata, including the date of the speech, the MEP ID, and a link to the video recording for potential multimodal research but not used here. Table 1 shows a detailed breakdown of the language distribution and descriptive statistics of the dataset. We include in our corpus documents only in the original language, as spoken by the MEP, but not their translations into other languages. Our only use of translations is the debate topics themselves.
Judgments To assess the relevance of documents to a given query, we use a binary relevance judgment, based on whether the speech was part of a debate on the given topic, resulting in one positive relevance judgment per document, meaning that the document collection is much less sparse than Mr. TYDI and MS MARCO, for example.
Languages Multi-EuP covers 24 EU languages from seven families (Germanic, Romance, Slavic, Uralic, Baltic, Semitic, Hellenic), each of which is the official language of one or more member states. Table 1 provides a breakdown of each language’s EU usage, member state distribution, and population, using ISO-639 codes.
MEP Multi-EuP encompasses 705 members elected across the 27 member states of the EU. We constructed the MEP dictionary by collecting MEP attributes such as name, photo, id in EU, nationality, place of birth, party affiliation, and spoken language. We further annotated MEPs with gender and their birthdate, based on Wikipedia profiles and Rabinovich et al. (2017), and manually checked if difference existing. Figure 1 illustrates the gender and age distribution across MEPs, with male MEPs being more than twice as numerous as female MEPs, and the majority falling within the 40– 70 age range. This corpus is rare, perhaps unique, due to its richly detailed speaker demographic information, which enables research on fairness and bias in information retrieval.
Data Split For data splitting, we select two sets with 100 language-specific and distinct topics for development and test set in 24 languages, and keep the remaining topics to the training set. This design choice was made to maintain an ample supply of topics and judgment samples essential for the training of deep learning models, and also facilitate subsequent cross-lingual comparative research.
Supported Task Similarly to Mr.TYDI (Zhang et al., 2021), Multi-EuP can be used for monolingual retrieval in English as well as non-English languages (eg. Swedish queries against Swedish documents). However, unlike Mr.TYDI, Multi-EuP encompasses multilingual documents and identical multilingual topics, ensuring that queries in different languages can be compared. Consequently, Multi-EuP can support diverse information retrieval experimental tasks. These including one-vs-one scenarios with single one language queries against single one language documents, in other words, monolingual or cross-lingual IR, one-vs-many scenarios with single-language queries against multilingual documents, i.e., multilingual IR, and many-vs-many scenarios involving multilingual queries against multilingual documents, i.e, mixed multilingual IR).
[5] https://www.europarl.europa.eu/translation/ en/translation-at-the-european-parliament/