The Gutenberg dialogue dataset

Abstract: Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and process dialogues from public-domain books made available by Project Gutenberg. We describe our dialogue extraction pipeline, analyze the effects of the various heuristics used, and present an error analysis of extracted dialogues. Finally, we conduct experiments showing that better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can be extended to further languages with little additional effort. Researchers can also build their versions of existing datasets by adjusting various trade-off parameters.

The Gutenberg dialogue dataset

Richard Csaky, Gábor Recski

Comments

Similar Papers

Diversifying Dialogue Generation with Non-Conversational Text

Hui Su, Xiaoyu Shen, Sanqiang Zhao and Zhou Xiao, Pengwei Hu, Randy Zhong, Cheng Niu, Jie Zhou

Keywords Abstract Paper

Diversifying Generation, low-diversity problem, open-domain generation, dialogue generation

A Structure Self-Aware Model for Discourse Parsing on Multi-Party Dialogues

Ante Wang, Linfeng Song, Hui Jiang and Shaopeng Lai, Junfeng Yao, Min Zhang, Jinsong Su

Keywords Abstract Paper

Natural Language Processing, Dialogue, Discourse, Tagging, Chunking, and Parsing

Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation

Biao Zhang, Ankur Bapna, Rico Sennrich, Orhan Firat

Keywords Abstract Paper

multilingual transformer, multilingual translation, language-specific modeling, conditional computation

Dialogue Distillation: Open-Domain Dialogue Augmentation Using Unpaired Data

Rongsheng Zhang, Yinhe Zheng, Jianzhi Shao and Xiaoxi Mao, Yadong Xi, Minlie Huang

Keywords Abstract Paper

collecting data, automatic evaluation, open-domain systems, neural models

Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight

Hengyi Cai, Hongshen Chen, Yonghao Song and Cheng Zhang, Xiaofang Zhao, Dawei Yin

Keywords Abstract Paper

Data Manipulation, Neural Generation, learning, dialogue generation

Learning Nearly Decomposable Value Functions Via Communication Minimization

Tonghan Wang*, Jianhao Wang*, Chongyi Zheng, Chongjie Zhang

Keywords Abstract Paper

Multi-agent reinforcement learning, Nearly decomposable value function, Minimized communication

Improving Stylized Neural Machine Translation with Iterative Dual Knowledge Transfer

Xuanxuan Wu, Jian Liu, Xinjie Li and Jinan Xu, Yufeng Chen, Yujie Zhang, Hui Huang

Keywords Abstract Paper

Natural Language Processing, Machine Translation, Natural Language Generation

Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems

Kaixuan Wei, Angelica I Aviles-Rivero, Jingwei Liang and Ying Fu, Carola-Bibiane Schönlieb, Hua Huang

Keywords Abstract Paper

Deep Learning - Algorithms

It's Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information

Emanuele Bugliarello, Sabrina J. Mielke, Antonios Anastasopoulos and Ryan Cotterell, Naoaki Okazaki

Keywords Abstract Paper

Measuring Difficulty, generation, asymmetric difficulty, machine difficulty

Diversity-Based Generalization for Unsupervised Text Classification under Domain Shift

Jitin Krishnan, Hemant Purohit, Huzefa Rangwala

Keywords Abstract Paper

text classification, unsupervised domain adaptation, natural language processing, neural networks

Using Context in Neural Machine Translation Training Objectives

Danielle Saunders, Felix Stahlberg, Bill Byrne

Keywords Abstract Paper

Neural training, NMT training, document-level training, NMT objective

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman and Kevin Gimpel, Piyush Sharma, Radu Soricut

Keywords Abstract Paper

Natural Language Processing, BERT, Representation Learning

Educating Text Autoencoders: Latent Representation Guidance via Denoising

Tianxiao Shen, Jonas Mueller, Regina Barzilay, Tommi Jaakkola

Keywords Abstract Paper

Deep Learning - Generative Models and Autoencoders

Factorized Higher-Order CNNs With an Application to Spatio-Temporal Emotion Estimation

Jean Kossaifi, Antoine Toisoul, Adrian Bulat and Yannis Panagakis, Timothy M. Hospedales, Maja Pantic

Keywords Abstract Paper

tensor methods, deep learning, spatiotemporal, emotion, cnn, tensor decomposition, low-rank, valence, arousal

Joint-Label Learning by Dual Augmentation for Time Series Classification

Qianli Ma, Zhenjing Zheng, Jiawei Zheng and Sen Li, Wanqing Zhuang, Garrison W. Cottrell

Keywords Abstract Paper

Cross-Lingual Unsupervised Sentiment Classification with Multi-View Transfer Learning

Hongliang Fei, Ping Li

Keywords Abstract Paper

Cross-Lingual Classification, sentiment classification, unsupervised system, classification

Two training strategies for improving relation extraction over universal graph

Qin Dai, Naoya Inoue, Ryo Takahashi, Kentaro Inui

Keywords Abstract Paper

Adaptive End-to-End Budgeted Network Learning via Inverse Scale Space

Zuyuan Zhong, Chen Liu, Yanwei Fu

Keywords Abstract Paper

deep learning, network architecture, growing network, budgeted network learning, pruning

Network-to-Network Translation with Conditional Invertible Neural Networks

Robin Rombach, Patrick Esser, Bjorn Ommer

Keywords Abstract Paper

Recipes for building an open-domain chatbot

Stephen Roller, Emily Dinan, Naman Goyal and Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, Jason Weston

Keywords Abstract Paper

Hui Su, Xiaoyu Shen, Sanqiang Zhao and
Zhou Xiao, Pengwei Hu, Randy Zhong, Cheng Niu, Jie Zhou

Keywords Paper

Ante Wang, Linfeng Song, Hui Jiang and
Shaopeng Lai, Junfeng Yao, Min Zhang, Jinsong Su

Keywords Paper

Keywords Paper

Rongsheng Zhang, Yinhe Zheng, Jianzhi Shao and
Xiaoxi Mao, Yadong Xi, Minlie Huang

Keywords Paper

Hengyi Cai, Hongshen Chen, Yonghao Song and
Cheng Zhang, Xiaofang Zhao, Dawei Yin

Keywords Paper

Tonghan Wang, Jianhao Wang, Chongyi Zheng, Chongjie Zhang

Keywords Paper

Xuanxuan Wu, Jian Liu, Xinjie Li and
Jinan Xu, Yufeng Chen, Yujie Zhang, Hui Huang

Keywords Paper

Kaixuan Wei, Angelica I Aviles-Rivero, Jingwei Liang and
Ying Fu, Carola-Bibiane Schönlieb, Hua Huang

Keywords Paper

Emanuele Bugliarello, Sabrina J. Mielke, Antonios Anastasopoulos and
Ryan Cotterell, Naoaki Okazaki

Keywords Paper

Keywords Paper

Keywords Paper

Zhenzhong Lan, Mingda Chen, Sebastian Goodman and
Kevin Gimpel, Piyush Sharma, Radu Soricut

Keywords Paper

Keywords Paper

Jean Kossaifi, Antoine Toisoul, Adrian Bulat and
Yannis Panagakis, Timothy M. Hospedales, Maja Pantic

Keywords Paper

Qianli Ma, Zhenjing Zheng, Jiawei Zheng and
Sen Li, Wanqing Zhuang, Garrison W. Cottrell

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Stephen Roller, Emily Dinan, Naman Goyal and
Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, Jason Weston

Keywords Paper

Keywords Paper

Ankit Arun, Soumya Batra, Vikas Bhardwaj and
Ashwini Challa, Pinar Donmez, Peyman Heidari, Hakan Inan, Shashank Jain, Anuj Kumar, Shawn Mei, Karthik Mohan, Michael White

Keywords Paper

Keywords Paper

Keywords Paper

Ye Liu, Yao Wan, Lifang He and
Hao Peng, Philip S. Yu

Keywords Paper

Keywords Paper

Emanuele La Malfa, Rhiannon Michelmore, Agnieszka M. Zbrzezny and
Nicola Paoletti, Marta Kwiatkowska

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

CAGLAR Gulcehre, Ziyu Wang, Alexander Novikov and
Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matthew Hoffman, Nicolas Heess, Nando de Freitas

Keywords Paper

Jon Ander Campos, Arantxa Otegi, Aitor Soroa and
Jan Deriu, Mark Cieliebak, Eneko Agirre

Keywords Paper

Aojun Zhou, Yukun Ma, Junnan Zhu and
Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, Hongsheng Li

Keywords Paper

Keywords Paper

James Gleeson, Sri Krishnan, Moshe Gabel and
Vijay Janapa Reddi, Eyal de Lara, Gennady Pekhimenko

Keywords Paper

James Gleeson, Sri Krishnan, Moshe Gabel and
Vijay Janapa Reddi, Eyal de Lara, Gennady Pekhimenko

Keywords Paper