ParaCrawl: Web-Scale Acquisition of Parallel Corpora

04/07/2020

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, Jaume Zaragoza

Keywords: sentence alignment, sentence filtering, machine systems, ParaCrawl

Abstract Paper Similar Papers

Abstract: We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ACL 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

19/08/2021

Weaving a Semantic Web of Credibility Reviews for Explainable Misinformation Detection (Extended Abstract)

Ronald Denaux, Martino Mensio, Jose Manuel Gomez-Perez, Harith Alani

Keywords Paper

Knowledge Representation and Reasoning, Semantic Web, Societal Impact of AI, Explainability, NLP Applications and Tools

0

0

0

0

14:06

04/07/2020

Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

Frank F. Xu, Zhengbao Jiang, Pengcheng Yin and
Bogdan Vasilescu, Graham Neubig

Keywords Paper

Open-domain generation, NL-to-code generation, data augmentation, retrieval-based re-sampling

0

0

0

0

6:16

04/07/2020

MixingBoard: a Knowledgeable Stylized Integrated Text Generation Platform

Xiang Gao, Michel Galley, Bill Dolan

Keywords Paper

knowledge generation, constrained generation, cross-model integration, local development

0

0

0

0

11:58

16/11/2020

Hardware as Policy: Mechanical and Computational Co-Optimization using Deep Reinforcement Learning

Tianjian Chen, Zhanpeng He, Matei Ciocarlie

Keywords Paper

0

0

0

0

4:51

06/12/2021

RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem

Eric Liang, Zhanghao Wu, Michael Luo and
Sven Mika, Joseph Gonzalez, Ion Stoica

Keywords Paper

reinforcement learning and planning

0

0

0

0

12:15

06/12/2020

Deep Imitation Learning for Bimanual Robotic Manipulation

Fan Xie, Alexander Chowdhury, Clara De Paolis Kaluza and
Linfeng Zhao, Lawson Wong, Rose Yu

Keywords Paper

0

0

0

0

3:12

02/06/2020

Equivalent Rewritings on Path Views with Binding Patterns

Julien Romero, Nicoleta Preda, Antoine Amarilli, Fabian Suchanek

Keywords Paper

0

0

0

0

29:00

02/02/2021

Author Homepage Discovery in CiteSeerX

Krutarth Patel, Cornelia Caragea, Doina Caragea, C. Lee Giles

Keywords Paper

0

0

0

0

16:27

25/04/2020

ScrAPIr: Making Web Data APIs Accessible to End Users

Tarfah Alrashed, Jumana Almahmoud, Amy Zhang, David Karger

Keywords Paper

web apis, api description language, web scraping

0

0

0

0

13:20

06/12/2021

PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair

Zimin Chen, Vincent J Hellendoorn, Pascal Lamblin and
Petros Maniatis, Pierre-Antoine Manzagol, Daniel Tarlow, Subhodeep Moitra

Keywords Paper

deep learning, machine learning, transformers, graph learning

0

0

0

0

5:59

29/06/2020

A dataset and an approach for identity resolution of 38 million author IDs extracted from 2B git commits

Tanner Fry, Tapajit Dey, Andrey Karnauch, Audris Mockus

Keywords Paper

Git Commits, Heuristics, Identity Resolution, Machine Learning, Data Sharing

0

0

0

0

5:02

29/06/2020

Detecting and characterizing bots that commit code

Tapajit Dey, Sara Mousavi, Eduardo Ponce and
Tanner Fry, Bogdan Vasilescu, Anna Filippova, Audris Mockus

Keywords Paper

random forest, software engineering, automated commits, bots, ensemble model, social coding platforms

0

0

0

0

11:21

06/12/2021

Row-clustering of a Point Process-valued Matrix

Lihao Yin, Ganggang Xu, Huiyan Sang, Yongtao Guan

Keywords Paper

machine learning, clustering

0

0

0

0

14:30

02/02/2021

Representing the Unification of Text Featurization using a Context-Free Grammar

Doruk Kilitcioglu, Serdar Kadioglu

Keywords Paper

0

0

0

0

15:28

23/08/2020

Towards automated neural interaction discovery for click-through rate prediction

Qingquan Song, Dehua Cheng, Hanning Zhou and
Jiyan Yang, Yuandong Tian, Xia Hu

Keywords Paper

neural architecture search, evolutionary algorithm, CTR prediction

0

0

0

0

18:00

22/09/2020

A federated recommender system for online services

Ben Tan, Bo Liu, Vincent Zheng, Qiang Yang

Keywords Paper

Recommender Systems, Federated Learning

0

0

0

0

1:40

19/08/2021

Deep Learning for Click-Through Rate Estimation

Weinan Zhang, Jiarui Qin, Wei Guo and
Ruiming Tang, Xiuqiang He

Keywords Paper

Machine learning, General, General

0

0

0

0

13:25

23/06/2021

Web Question Answering with Neurosymbolic Program Synthesis

Qiaochu Chen, Aaron Lamoreaux, Xinyu Wang and
Greg Durrett, Osbert Bastani, Isil Dillig

Keywords Paper

Program Synthesis, Programming by Example, Web Information Extraction

0

0

0

0

15:15

30/11/2020

Novel-View Human Action Synthesis

Mohamed Ilyes Lakhal, Davide Boscaini, Fabio Poiesi and
Oswald Lanz, Andrea Cavallaro

Keywords Paper

0

0

0

0

4:34

25/07/2020

Ranking-incentivized quality preserving content modification

Gregory Goren, Oren Kurland, Moshe Tennenholtz, Fiana Raiber

Keywords Paper

search engine optimization, adversarial retrieval

0

0

0

0

16:07

25/07/2020

Vis-Trec: A System for the In-depth Analysis of trec_eval Results

Mahtab Tamannaee, Negar Arabzadeh, Ebrahim Bagheri

Keywords Paper

trec-eval, information retrieval

0

0

0

0

10:39

29/06/2020

Employing contribution and quality metrics for quantifying the software development process

Themistoklis Diamantopoulos, Michail D. Papamichail, Thomas Karanikiotis and
Kyriakos C. Chatzidimitriou, Andreas L. Symeonidis

Keywords Paper

code violations, mining software repositories, contribution analysis, DevOps, GitHub issues

0

0

0

0

4:59

15/11/2020

DiffStream: Differential Output Testing for Stream Processing Programs

Konstantinos Kallas, Filip Niksic, Caleb Stanford, Rajeev Alur

Keywords Paper

runtime verification, differential testing, stream processing

0

0

0

0

15:50

23/08/2020

AutoML pipeline selection: Efficiently navigating the combinatorial space

Chengrun Yang, Jicong Fan, Ziyang Wu, Madeleine Udell

Keywords Paper

pipeline search, greedy algorithms, experiment design, AutoML, tensor decomposition, submodular optimization, meta-learning

0

0

0

0

13:40

26/04/2020

Behaviour Suite for Reinforcement Learning

Ian Osband, Yotam Doron, Matteo Hessel and
John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt

Keywords Paper

reinforcement learning, benchmark, core issues, scalability, reproducibility

0

0

0

0

5:58

29/06/2020

Determining the intrinsic structure of public software development history

Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli

Keywords Paper

statistical mechanics, source code, graphstructure, networktopology, version controlsystem

0

0

0

0

5:17

16/11/2020

Multi-Level Structure vs. End-to-End-Learning in High-Performance Tactile Robotic Manipulation

Florian Voigt, Lars Johannsmeier, Sami Haddadin

Keywords Paper

0

0

0

0

5:13

06/12/2021

Integrating Tree Path in Transformer for Code Representation

Han Peng, Ge Li, Wenhan Wang and
YunFei Zhao, Zhi Jin

Keywords Paper

machine learning, transformers

0

0

0

0

4:42

12/08/2020

HALucinator: Firmware Re-hosting Through Abstraction Layer Emulation

Abraham A Clements, Eric Gustafson, Tobias Scharnowski and
Paul Grosen, David Fritz, Christopher Kruegel, Giovanni Vigna, Saurabh Bagchi, Mathias Payer

Keywords Paper

0

0

0

0

12:12

29/06/2020

An exploratory study to find motives behind cross-platform forks from software heritage dataset

Avijit Bhattacharjee, Sristy Sumana Nath, Shurui Zhou and
Debasish Chakroborti, Banani Roy, Chanchal K. Roy, Kevin Schneider

Keywords Paper

OSS, Social coding, Cross-platform forks, Collaboration

0

0

0

0

4:50

19/08/2021

Web Interoperability for Ontology Development and Support with crowd 2.0

German Braun, Giuliano Marinelli, Emiliano Rios Gavagnin and
Laura Cecchi, Pablo Fillottrani

Keywords Paper

Knowledge Representation and Reasoning, General

0

0

0

0

9:46

26/04/2020

Intrinsically Motivated Discovery of Diverse Patterns in Self-Organizing Systems

Chris Reinke, Mayalen Etcheverry, Pierre-Yves Oudeyer

Keywords Paper

deep learning, unsupervised Learning, self-organization, game-of-life

0

0

0

0

14:57

02/06/2020

SchemaTree: Maximum-Likelihood Property Recommendation for Wikidata

Lars C. Gleim, Rafael Schimassek, Dominik Hüser and
Maximilian Peters, Christoph Krämer, Michael Cochez et al.

Keywords Paper

0

0

0

0

29:33

19/04/2021

Process-level representation of scientific protocols with interactive annotation

Ronen Tamari, Fan Bai, Alan Ritter, Gabriel Stanovsky

Keywords Paper

0

0

0

0

11:27

22/11/2021

Depth-aware Object Segmentation and Grasp Detection for Robotic Picking Tasks

Stefan Ainetter, Christoph Böhm, Rohit Dhakate and
Stephan Weiss, Friedrich Fraundorfer

Keywords Paper

Robotic picking, grasp detection, depth-aware object segmentation

0

0

0

0

3:07

29/06/2020

GitterCom: A dataset of open source developer communications in gitter

Esteban Parra, Ashley Ellis, Sonia Haiduc

Keywords Paper

social media, datasets, chat, team communication

0

0

0

0

5:08

02/06/2020

Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach

David Schindler, Benjamin Zapilko, Frank Krüger

Keywords Paper

0

0

0

0

27:13

19/08/2021

Convexified Graph Neural Networks for Distributed Control in Robotic Swarms

Saar Cohen, Noa Agmon

Keywords Paper

Machine Learning, Deep Learning, Kernel Methods, Online Learning, Multi-Robot Systems

0

0

0

0

12:09

02/02/2021

Constraint Logic Programming for Real-World Test Laboratory Scheduling

Tobias Geibinger, Florian Mischek, Nysret Musliu

Keywords Paper

0

0

0

0

19:26

25/07/2020

Web of scholars: A scholar knowledge graph

Jiaying Liu, Jing Ren, Wenqing Zheng and
Lianhua Chi, Ivan Lee, Feng Xia

Keywords Paper

relationship mining, knowledge graph, web of scholars

0

0

0

0

9:07