WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia

19/04/2021

WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia

Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, Francisco Guzmán

Keywords:

Abstract Paper Similar Papers

Abstract: We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but we systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 16720 different language pairs, out of which only 34M are aligned with English. This corpus is freely available. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at EACL 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

19/04/2021

Coordinate constructions in English enhanced Universal Dependencies: Analysis and computational modeling

Stefan Grünewald, Prisca Piccirilli, Annemarie Friedrich

Keywords Paper

0

0

0

0

12:44

08/12/2020

SpanAlign: Sentence Alignment Method based on Cross-Language Span Prediction and ILP

Katsuki Chousa, Masaaki Nagata, Masaaki Nishino

Keywords Paper

0

0

0

0

14:39

06/12/2021

Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk

Keywords Paper

0

0

0

0

10:52

16/11/2020

Training Question Answering Models From Synthetic Data

Raul Puri, Ryan Spring, Mohammad Shoeybi and
Mostofa Patwary, Bryan Catanzaro

Keywords Paper

question generation, squad task, em, data method

0

0

0

0

11:33

04/07/2020

Should All Cross-Lingual Embeddings Speak English?

Antonios Anastasopoulos, Graham Neubig

Keywords Paper

cross-lingual embeddings, lexicon tagging, lexicon dictionaries, cross-lingual baselines

0

0

0

0

9:25

16/11/2020

X-SRL: A Parallel Cross-Lingual Semantic Role Labeling Dataset

Angel Daza, Anette Frank

Keywords Paper

generalization learning, multilingual learning, high-quality translation, srl

0

0

0

0

9:24

16/11/2020

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

Zehui Lin, Xiao Pan, Mingxuan Wang and
Xipeng Qiu, Jiangtao Feng, Hao Zhou, Lei Li

Keywords Paper

machine mt, mt, rich mt, universal model

0

0

0

0

12:00

16/11/2020

ToTTo: A Controlled Table-To-Text Generation Dataset

Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann and
Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, Dipanjan Das

Keywords Paper

controlled task, high-precision generation, totto, dataset process

0

0

0

0

11:53

19/08/2021

Generating Senses and RoLes: An End-to-End Model for Dependency- and Span-based Semantic Role Labeling

Rexhina Blloshmi, Simone Conia, Rocco Tripodi, Roberto Navigli

Keywords Paper

Natural Language Processing, Natural Language Semantics, Natural Language Generation, Natural Language Processing

0

0

0

0

15:18

19/08/2021

MultiMirror: Neural Cross-lingual Word Alignment for Multilingual Word Sense Disambiguation

Luigi Procopio, Edoardo Barba, Federico Martelli, Roberto Navigli

Keywords Paper

Natural Language Processing, Natural Language Semantics, Resources and Evaluation

0

0

0

0

12:25

16/11/2020

End-to-End Slot Alignment and Recognition for Cross-Lingual NLU

Weijia Xu, Batool Haider, Saab Mansour

Keywords Paper

natural understanding, natural, nlu, goal-oriented systems

0

0

0

0

9:46

18/07/2021

K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets

Xiu Su, Shan You, Mingkai Zheng and
Fei Wang, Chen Qian, Changshui Zhang, Chang Xu

Keywords Paper

Deep Learning, Architectures

0

0

0

0

5:08

04/07/2020

Programming in Natural Language with fuSE: Synthesizing Methods from Spoken Utterances Using Deep Natural Language Understanding

Sebastian Weigelt, Vanessa Steurer, Tobias Hey, Walter F. Tichy

Keywords Paper

intelligent systems, information retrieval, Deep Understanding, end-user programming

0

0

0

0

11:41

04/07/2020

Multilingual Universal Sentence Encoder for Semantic Retrieval

Yinfei Yang, Daniel Cer, Amin Ahmad and
Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-hsuan Sung, Brian Strope, Ray Kurzweil

Keywords Paper

Semantic Retrieval, translation tasks, monolingual retrieval, translation retrieval

0

0

0

0

12:02

04/07/2020

Bilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences

Xiangyu Duan, Baijun Ji, Hao Jia and
Min Tan, Min Zhang, Boxing Chen, Weihua Luo, Yue Zhang

Keywords Paper

Bilingual Translation, machine MT, MT, dictionary-based translation

0

0

0

0

14:08

19/08/2021

ALaSca: an Automated approach for Large-Scale Lexical Substitution

Caterina Lacerra, Tommaso Pasini, Rocco Tripodi, Roberto Navigli

Keywords Paper

Natural Language Processing, Natural Language Semantics, Resources and Evaluation

0

0

0

0

14:27

16/11/2020

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzmán, Philipp Koehn

Keywords Paper

cross-lingual alignment, mining sentences, cross-lingual nlp, cross-lingual representations

0

0

0

0

11:47

19/04/2021

Attention can reflect syntactic structure (if you let it)

Vinit Ravishankar, Artur Kulmizev, Mostafa Abdou and
Anders Søgaard, Joakim Nivre

Keywords Paper

0

0

0

0

11:36

03/05/2021

On Learning Universal Representations Across Languages

Xiangpeng Wei, Rongxiang Weng, Yue Hu and
Luxi Xing, Heng Yu, Weihua Luo

Keywords Paper

hierarchical contrastive learning, cross-lingual pretraining, universal representation learning

0

0

0

0

3:51

01/07/2020

RobertNLP at the IWPT 2020 Shared Task: Surprisingly Simple Enhanced UD Parsing for English

Stefan Grünewald, Annemarie Friedrich

Keywords Paper

0

0

0

0

7:40

04/07/2020

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Ashish V. Thapliyal, Radu Soricut

Keywords Paper

Cross-modal Generation, Web-scale Coverage, Cross-modal tasks, Pivot Stabilization

0

0

0

0

11:43

16/11/2020

Simulated multiple reference training improves low-resource machine translation

Huda Khayrallah, Brian Thompson, Matt Post, Philipp Koehn

Keywords Paper

machine mt, mt, simulated training, simulated

0

0

0

0

6:56

16/11/2020

X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models

Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki and
Haibo Ding, Graham Neubig

Keywords Paper

factual retrieval, language models, lms, probing methods

0

0

0

0

9:45

04/07/2020

Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension

Fei Yuan, Linjun Shou, Xuanyu Bai and
Ming Gong, Yaobo Liang, Nan Duan, Yan Fu, Daxin Jiang

Keywords Paper

Multilingual Comprehension, multilingual MRC, MRC, sentence tasks

0

0

0

0

8:30

16/11/2020

LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

Tasnim Mohiuddin, M Saiful Bari, Shafiq Joty

Keywords Paper

bilingual induction, bilingual, bli, semi-supervised method

0

0

0

0

12:09

01/07/2020

Expand and Filter: CUNI and LMU Systems for the WNGT 2020 Duolingo Shared Task

Jindřich Libovický, Zdeněk Kasner, Jindřich Helcl, Ondřej Dušek

Keywords Paper

0

0

0

0

4:59

04/07/2020

Named Entity Recognition as Dependency Parsing

Juntao Yu, Bernd Bohnet, Massimo Poesio

Keywords Paper

Named Recognition, NER, Natural Processing, NER research

0

0

0

0

7:16

02/02/2021

Multilingual Transfer Learning for QA using Translation as Data Augmentation

Mihaela Bornea, Lin Pan, Sara Rosenthal and
Radu Florian, Avirup Sil

Keywords Paper

0

0

0

0

15:44

16/11/2020

Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation

Tuhin Chakrabarty, Smaranda Muresan, Nanyun Peng

Keywords Paper

human imagination, simile generation, mapping properties, sequence model

0

0

0

0

11:11

04/07/2020

Learning and Evaluating Emotion Lexicons for 91 Languages

Sven Buechel, Susanna Rücker, Udo Hahn

Keywords Paper

sentiment analysis, downstream applications, lexicon creation, bilingual model

0

0

0

0

12:42

22/06/2020

Cross-context News Corpus for Protest Events related Knowledge Base Construction

Ali Hürriyetoğlu, Erdem Yörük, Deniz Yüret and
Osman Mutlu, Çağrı Yoltar, Fırat Duruşan, Burak Gürel

Keywords Paper

protests, contentious politics, news, text classification, event extraction, social sciences, political sciences, computational social science

0

0

0

0

4:45

06/12/2020

Cross-lingual Retrieval for Iterative Self-Supervised Training

Chau Tran, Yuqing Tang, Xian Li, Jiatao Gu

Keywords Paper

0

0

0

0

3:11

16/11/2020

Semantic Drift in Multilingual Representations

Lisa Beinborn, Rochelle Choenni

Keywords Paper

multilingual representations, computational representations, representational analysis, analysis method

0

0

0

0

12:44

16/11/2020

With More Contexts Comes Better Performance: Contextualized Sense Embeddings for All-Round Word Sense Disambiguation

Bianca Scarlini, Tommaso Pasini, Roberto Navigli

Keywords Paper

natural processing, english task, word-in-context task, contextualized embeddings

0

0

0

0

12:11

19/04/2021

Learning coupled policies for simultaneous machine translation using imitation learning

Philip Arthur, Trevor Cohn, Gholamreza Haffari

Keywords Paper

0

0

0

0

11:55

08/12/2020

BME-TUW at SR’20: Lexical grammar induction for surface realization

Gábor Recski, Ádám Kovács, Kinga Gémes and
Judit Ács, Andras Kornai

Keywords Paper

0

0

0

0

15:32

26/04/2020

Neural Machine Translation with Universal Visual Representation

Zhuosheng Zhang, Kehai Chen, Rui Wang and
Masao Utiyama, Eiichiro Sumita, Zuchao Li, Hai Zhao

Keywords Paper

Neural Machine Translation, Visual Representation, Multimodal Machine Translation, Language Representation

0

0

0

0

4:50

29/06/2020

What is the vocabulary of flaky tests?

Gustavo Pinto, Breno Miranda, Supun Dissanayake and
Marcelo Amorim, Christoph Treude, Antonia Bertolino

Keywords Paper

Regression testing, Text classification, Test flakiness

0

0

0

0

13:04

16/11/2020

PyMT5: multi-mode translation of natural language and Python code with transformers

Colin Clement, Dawn Drain, Jonathan Timcheck and
Alexey Svyatkovskiy, Neel Sundaresan

Keywords Paper

automated understanding, docstring generation, method generation, docstring summarization

0

0

0

0

10:43

04/07/2020

Syntactic Search by Example

Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, Yoav Goldberg

Keywords Paper

Syntactic Search, Search, syntax-based queries, syntactic representations

0

0

0

0

11:23