CamemBERT: a Tasty French Language Model

04/07/2020

CamemBERT: a Tasty French Language Model

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, Benoît Sagot

Keywords: Natural Processing, part-of-speech tagging, dependency parsing, named recognition

Abstract Paper Similar Papers

Abstract: Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ACL 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

08/12/2020

Emergent Communication Pretraining for Few-Shot Machine Translation

Yaoyiran Li, Edoardo Maria Ponti, Ivan Vulić, Anna Korhonen

Keywords Paper

0

0

0

0

14:42

26/04/2020

Mogrifier LSTM

Gábor Melis, Tomáš Kočiský, Phil Blunsom

Keywords Paper

lstm, language modelling

0

0

0

0

15:10

04/07/2020

Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, Jaewoo Kang

Keywords Paper

Real-Time Answering, Open-domain answering, phrase problem, Contextualized Representations

0

0

0

0

6:35

19/04/2021

Lexical normalization for code-switched data and its effect on POS tagging

Rob Goot, Özlem Çetinoğlu

Keywords Paper

0

0

0

0

12:13

04/07/2020

Efficient Contextual Representation Learning With Continuous Outputs

Liunian Harold Li, Patrick H. Chen, Cho-Jui Hsieh, Kai-Wei Chang

Keywords Paper

natural tasks, Contextual Learning, Contextual models, language-model-based encoders

0

0

0

0

11:51

04/07/2020

Hypernymy Detection for Low-Resource Languages via Meta Learning

Changlong Yu, Jialong Han, Haisong Zhang, Wilfred Ng

Keywords Paper

Hypernymy Detection, lexical entailment, natural tasks, monolingual detection

0

0

0

0

6:53

04/07/2020

Soft Gazetteers for Low-Resource Named Entity Recognition

Shruti Rijhwani, Shuyan Zhou, Graham Neubig, Jaime Carbonell

Keywords Paper

Low-Resource Recognition, named recognition, ', Soft Gazetteers

0

0

0

0

7:23

16/11/2020

New Protocols and Negative Results for Textual Entailment Data Collection

Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, Emily Pitler

Keywords Paper

benchmarking, language understanding, transfer applications, crowdsourcing protocol

0

0

0

0

12:27

16/11/2020

Tackling the Low-resource Challenge for Canonical Segmentation

Manuel Mager, Özlem Çetinoğlu, Katharina Kann

Keywords Paper

morphological generation, canonical segmentation, lstm pointer-generator, sequence-to-sequence model

0

0

0

0

11:55

05/12/2020

Vocabulary matters: A simple yet effective approach to paragraph-level question generation

Vishwajeet Kumar, Manish Joshi, Ganesh Ramakrishnan, Yuan-Fang Li

Keywords Paper

0

0

0

0

8:36

01/07/2020

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

Ekaterina Vylomova, Jennifer White, Elizabeth Salesky and
Sabrina J. Mielke, Shijie Wu, Edoardo Maria Ponti, Rowan Hall Maudslay, Ran Zmigrod, Josef Valvoda, Svetlana Toldova, Francis Tyers, Elena Klyachko, Ilya Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka Silfverberg, Mans Hulden

Keywords Paper

0

0

0

0

14:11

05/01/2021

Deep Interactive Thin Object Selection

Jun Hao Liew, Scott Cohen, Brian Price and
Long Mai, Jiashi Feng

Keywords Paper

0

0

0

0

4:48

19/04/2021

Multilingual and cross-lingual document classification: A meta-learning approach

Niels Heijden, Helen Yannakoudakis, Pushkar Mishra, Ekaterina Shutova

Keywords Paper

0

0

0

0

11:51

06/12/2020

Modular Meta-Learning with Shrinkage

Yutian Chen, Abe Friesen, Feryal Behbahani and
Arnaud Doucet, David Budden, Matthew Hoffman, Nando de Freitas

Keywords Paper

0

0

0

0

3:21

02/02/2021

KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning

Ye Liu, Yao Wan, Lifang He and
Hao Peng, Philip S. Yu

Keywords Paper

0

0

0

0

17:52

08/12/2020

Learning as Abduction: Trainable Natural Logic Theorem Prover for Natural Language Inference

Lasha Abzianidze

Keywords Paper

0

0

0

0

14:56

22/11/2021

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

Subhabrata Choudhury, Iro Laina, Christian Rupprecht, Andrea Vedaldi

Keywords Paper

fine-grained recognition, weakly-supervised recognition, fine-grained retrieval, unsupervised recognition, image-to-text retrieval, text-to-image retrieval, image classification

0

0

0

0

8:53

26/04/2020

DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

Sachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, Hannaneh Hajishirzi

Keywords Paper

sequence modeling, input representations, language modeling, word embedding

0

0

0

0

4:50

16/11/2020

X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models

Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki and
Haibo Ding, Graham Neubig

Keywords Paper

factual retrieval, language models, lms, probing methods

0

0

0

0

9:45

14/06/2020

The GAN That Warped: Semantic Attribute Editing With Unpaired Data

Garoe Dorta, Sara Vicente, Neill D. F. Campbell, Ivor J. A. Simpson

Keywords Paper

image editing, warping, high resolution, unpaired data, deep neural networks

0

0

0

0

1:01

19/04/2021

PPT: Parsimonious parser transfer for unsupervised cross-lingual adaptation

Kemal Kurniawan, Lea Frermann, Philip Schulz, Trevor Cohn

Keywords Paper

0

0

0

0

11:52

06/12/2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Zi-Hang Jiang, Weihao Yu, Daquan Zhou and
Yunpeng Chen, Jiashi Feng, Shuicheng Yan

Keywords Paper

0

0

0

0

3:20

19/04/2021

On the (in)effectiveness of images for text classification

Chunpeng Ma, Aili Shen, Hiyori Yoshikawa and
Tomoya Iwakura, Daniel Beck, Timothy Baldwin

Keywords Paper

0

0

0

0

6:15

06/12/2021

On sensitivity of meta-learning to support data

Mayank Agarwal, Mikhail Yurochkin, Yuekai Sun

Keywords Paper

machine learning, robustness, vision, meta learning, few shot learning

0

0

0

0

14:08

16/11/2020

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Charles Welch, Rada Mihalcea, Jonathan K. Kummerfeld

Keywords Paper

nlp applications, language model, language research, byte-pair encoding

0

0

0

0

5:12

02/02/2021

SARG: A Novel Semi Autoregressive Generator for Multi-turn Incomplete Utterance Restoration

Mengzuo Huang, Feng Li, Wuhe Zou, Weidong Zhang

Keywords Paper

0

0

0

0

14:50

16/11/2020

Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses

Simon Flachs, Ophélie Lacroix, Helen Yannakoudakis and
Marek Rei, Anders Søgaard

Keywords Paper

gec applications, gec, gec systems, internal model

0

0

0

0

10:16

04/07/2020

BLEURT: Learning Robust Metrics for Text Generation

Thibault Sellam, Dipanjan Das, Ankur Parikh

Keywords Paper

Learning Metrics, Text Generation, WMT task, pre-training scheme

0

0

0

0

11:46

18/07/2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Wonjae Kim, Bokyung Son, Ildoo Kim

Keywords Paper

Algorithms, Multimodal Learning

0

0

0

0

19:03

04/07/2020

LINSPECTOR: Multilingual Probing Tasks for Word Representations

Gözde Gül Sahin, Clara Vania, Ilia Kuznetsov, Iryna Gurevych

Keywords Paper

Word Representations, NLP, classification tasks, probing tasks

0

0

0

0

11:51

06/12/2021

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Yi Ren, Jinglin Liu, Zhou Zhao

Keywords Paper

generative model

0

0

0

0

10:15

08/12/2020

Try to Substitute: An Unsupervised Chinese Word Sense Disambiguation Method Based on HowNet

Bairu Hou, Fanchao Qi, Yuan Zang and
Xurui Zhang, Zhiyuan Liu, Maosong Sun

Keywords Paper

0

0

0

0

7:54

08/12/2020

Domain Transfer based Data Augmentation for Neural Query Translation

Liang Yao, Baosong Yang, Haibo Zhang and
Boxing Chen, Weihua Luo

Keywords Paper

0

0

0

0

10:57

22/11/2021

Audio-Visual Speech Super-Resolution

Rudrabha Mukhopadhyay, Sindhu B Hegde, Vinay Namboodiri, C.V. Jawahar

Keywords Paper

speech super-resolution, audio-visual data, audio-visual learning, pseudo-visual stream, multi-modal learning

0

0

0

0

10:01

08/12/2020

Best Practices for Data-Efficient Modeling in NLG:How to Train Production-Ready Neural Models with Less Data

Ankit Arun, Soumya Batra, Vikas Bhardwaj and
Ashwini Challa, Pinar Donmez, Peyman Heidari, Hakan Inan, Shashank Jain, Anuj Kumar, Shawn Mei, Karthik Mohan, Michael White

Keywords Paper

0

0

0

0

15:01

25/07/2020

Reranking for efficient transformer-based answer selection

Yoshitomo Matsubara, Thuy Vu, Alessandro Moschitti

Keywords Paper

natural language processing, question answering, transformer models, neural networks, information retrieval, reranking

0

0

0

0

9:45

05/12/2020

Heads-up! Unsupervised constituency parsing via self-attention heads

Bowen Li, Taeuk Kim, Reinald Kim Amplayo, Frank Keller

Keywords Paper

0

0

0

0

13:55

02/02/2021

Commonsense Knowledge Augmentation for Low-Resource Languages via Adversarial Learning

Bosung Kim, Juae Kim, Youngjoong Ko, Jungyun Seo

Keywords Paper

0

0

0

0

19:38

16/11/2020

Do sequence-to-sequence VAEs learn global features of sentences?

Tom Bosc, Pascal Vincent

Keywords Paper

generation, memorization, autoregressive models, variational autoencoder

0

0

0

0

12:00

16/11/2020

Task-oriented Domain-specific Meta-Embedding for Text Classification

Xin Wu, Yi Cai, Yang Kai and
Tao Wang, Qing Li

Keywords Paper

natural tasks, downstream tasks, meta-embedding learning, meta-embedding methods

0

0

0

0

7:03