Building a Japanese Typo Dataset from Wikipedia's Revision History

04/07/2020

Building a Japanese Typo Dataset from Wikipedia's Revision History

Yu Tanaka, Yugo Murawaki, Daisuke Kawahara, Sadao Kurohashi

Keywords: NLP systems, typo correction, data-driven system, spelling checker

Abstract Paper Similar Papers

Abstract: User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ACL 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

16/11/2020

Few-Shot Learning for Opinion Summarization

Arthur Bražinskas, Mirella Lapata, Ivan Titov

Keywords Paper

opinion summarization, automatic text, summary production, summarization mode

0

0

0

0

11:48

04/07/2020

LINSPECTOR: Multilingual Probing Tasks for Word Representations

Gözde Gül Sahin, Clara Vania, Ilia Kuznetsov, Iryna Gurevych

Keywords Paper

Word Representations, NLP, classification tasks, probing tasks

0

0

0

0

11:51

04/07/2020

Language Models as an Alternative Evaluator of Word Order Hypotheses: A Case Study in Japanese

Tatsuki Kuribayashi, Takumi Ito, Jun Suzuki, Kentaro Inui

Keywords Paper

Evaluator Hypotheses, analyzing order, Language Models, neural models

0

0

0

0

9:33

19/04/2021

How to evaluate a summarizer: Study design and statistical analysis for manual linguistic quality evaluation

Julius Steen, Katja Markert

Keywords Paper

0

0

0

0

12:04

04/07/2020

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

Kai Sun, Dian Yu, Dong Yu, Claire Cardie

Keywords Paper

Chinese Comprehension, Machine tasks, real-world problems, data augmentation

0

0

0

0

11:47

02/02/2021

Towards Fully Automated Manga Translation

Ryota Hinami, Shonosuke Ishiwatari, Kazuhiko Yasuda, Yusuke Matsui

Keywords Paper

0

0

0

0

19:47

19/04/2021

Don’t change me! User-controllable selective paraphrase generation

Mohan Zhang, Luchen Tan, Zihang Fu and
Kun Xiong, Jimmy Lin, Ming Li, Zhengkai Tu

Keywords Paper

0

0

0

0

6:03

05/12/2020

UnihanLM: Coarse-to-fine Chinese-Japanese language model pretraining with the unihan database

Canwen Xu, Tao Ge, Chenliang Li, Furu Wei

Keywords Paper

0

0

0

0

8:52

16/11/2020

ToTTo: A Controlled Table-To-Text Generation Dataset

Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann and
Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, Dipanjan Das

Keywords Paper

controlled task, high-precision generation, totto, dataset process

0

0

0

0

11:53

05/12/2020

Sina Mandarin alphabetical words:a web-driven code-mixing lexical resource

Rong Xiang, Mingyu Wan, Qi Su and
Chu-Ren Huang, Qin Lu

Keywords Paper

0

0

0

0

15:28

08/12/2020

ContraCAT: Contrastive Coreference Analytical Templates for Machine Translation

Dario Stojanovski, Benno Krojer, Denis Peskov, Alexander Fraser

Keywords Paper

0

0

0

0

14:09

16/11/2020

Continuity of Topic, Interaction, and Query: Learning to Quote in Online Conversations

Lingzhi Wang, Jing Li, Xingshan Zeng and
Haisong Zhang, Kam-Fai Wong

Keywords Paper

persuasions, automatic generation, language generation, encoder-decoder framework

0

0

0

0

11:43

08/12/2020

Informative Manual Evaluation of Machine Translation Output

Maja Popović

Keywords Paper

0

0

0

0

15:26

08/12/2020

Automatic Word Association Norms (AWAN)

Jorge Reyes-Magaña, Gerardo Sierra Martínez, Gemma Bel-Enguix, Helena Gomez-Adorno

Keywords Paper

0

0

0

0

14:34

16/11/2020

Iterative Feature Mining for Constraint-Based Data Collection to Increase Data Diversity and Model Robustness

Stefan Larson, Anthony Zheng, Anish Mahendran and
Rishi Tekriwal, Adrian Cheung, Eric Guldan, Kevin Leach, Jonathan K. Kummerfeld

Keywords Paper

dialog tasks, intent classification, slot-filling, robust models

0

0

0

0

6:52

08/12/2020

Synonym Knowledge Enhanced Reader for Chinese Idiom Reading Comprehension

Siyu Long, Ran Wang, Kun Tao and
Jiali Zeng, Xinyu Dai

Keywords Paper

0

0

0

0

9:58

25/07/2020

Joint aspect-sentiment analysis with minimal user guidance

Honglei Zhuang, Fang Guo, Chao Zhang and
Liyuan Liu, Jiawei Han

Keywords Paper

weakly-supervised, autoencoder, aspect-based sentiment analysis

0

0

0

0

14:01

08/12/2020

Is it Great or Terrible? Preserving Sentiment in Neural Machine Translation of Arabic Reviews

Hadeel Saadany, Constantin Orasan

Keywords Paper

0

0

0

0

14:35

08/12/2020

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Chen Gong, Zhenghua Li, Bowei Zou, Min Zhang

Keywords Paper

0

0

0

0

14:48

04/07/2020

2kenize: Tying Subword Sequences for Chinese Script Conversion

- Pranav A, Isabelle Augenstein

Keywords Paper

Chinese Conversion, Chinese NLP, mapping sequences, topic classification

0

0

0

0

10:32

04/07/2020

A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing

Kartik Goyal, Chris Dyer, Christopher Warren and
Maxwell G'Sell, Taylor Berg-Kirkpatrick

Keywords Paper

Typographical Printing, clustering images, archiving process, Early printing

0

0

0

0

7:07

04/07/2020

Unsupervised Opinion Summarization as Copycat-Review Generation

Arthur Bražinskas, Mirella Lapata, Ivan Titov

Keywords Paper

Unsupervised Summarization, Copycat-Review Generation, Opinion summarization, automatically summaries

0

0

0

0

10:55

16/11/2020

Generating similes effortlessly like a Pro: A Style Transfer Approach for Simile Generation

Tuhin Chakrabarty, Smaranda Muresan, Nanyun Peng

Keywords Paper

human imagination, simile generation, mapping properties, sequence model

0

0

0

0

11:11

01/07/2020

English-to-Japanese Diverse Translation by Combining Forward and Backward Outputs

Masahiro Kaneko, Aizhan Imankulova, Tosho Hirasawa, Mamoru Komachi

Keywords Paper

0

0

0

0

5:00

22/06/2020

Cross-context News Corpus for Protest Events related Knowledge Base Construction

Ali Hürriyetoğlu, Erdem Yörük, Deniz Yüret and
Osman Mutlu, Çağrı Yoltar, Fırat Duruşan, Burak Gürel

Keywords Paper

protests, contentious politics, news, text classification, event extraction, social sciences, political sciences, computational social science

0

0

0

0

4:45

04/07/2020

An Empirical Comparison of Unsupervised Constituency Parsing Methods

Jun Li, Yifan Cao, Jiong Cai and
Yong Jiang, Kewei Tu

Keywords Paper

data preprocessing, empirical parsing, unsupervised parsing, Unsupervised Methods

0

0

0

0

6:16

07/06/2020

WikiHist.html: English Wikipedia’s Full Revision History in HTML Format

Blagoj Mitrevski, Tiziano Piccardi, Robert West

Keywords Paper

languages, links, rest

0

0

0

1

2:44

02/02/2021

Writing Polishment with Simile: Task, Dataset and A Neural Approach

Jiayi Zhang, Zhi Cui, Xiaoqiang Xia and
Yalong Guo, Yanran Li, Chen Wei, Jianwei Cui

Keywords Paper

0

0

0

0

13:52

26/04/2020

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

Sumanth Dathathri, Andrea Madotto, Janice Lan and
Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu

Keywords Paper

controlled text generation, generative models, conditional generative models, language modeling, transformer

0

0

1

1

4:58

07/06/2020

Source Attribution: Recovering the Press Releases behind Health Science News

Ansel MacLaughlin, John Wihbey, Aleszu Bajak, David A. Smith

Keywords Paper

articles, contexts, health, humans, news, news articles, predictions, relationships, representations, sources, texts

0

0

0

0

9:46

16/11/2020

Multi-resolution Annotations for Emoji Prediction

Weicheng Ma, Ruibo Liu, Lili Wang, Soroush Vosoughi

Keywords Paper

natural tasks, emojis, linguistic components, multi-class setting

0

0

0

0

11:52

04/07/2020

Paraphrase Generation by Learning How to Edit from Samples

Amirhossein Kazemnejad, Mohammadreza Salehi, Mahdieh Soleymani Baghshah

Keywords Paper

Paraphrase Generation, Neural sequence, sequence generation, retrieval-based method

0

0

0

0

12:20

01/07/2020

A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards

Zi-Yi Dou, Sachin Kumar, Yulia Tsvetkov

Keywords Paper

0

0

0

0

4:35

08/12/2020

Effective Use of Target-side Context for Neural Machine Translation

Hideya Mino, Hitoshi Ito, Isao Goto and
Ichiro Yamada, Takenobu Tokunaga

Keywords Paper

0

0

0

0

13:42

04/07/2020

Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

Hila Gonen, Ganesh Jawahar, Djamé Seddah, Yoav Goldberg

Keywords Paper

computational science, word embeddings, vector alignment, vector spaces

0

0

0

0

10:42

04/07/2020

Enabling Language Models to Fill in the Blanks

Chris Donahue, Mina Lee, Percy Liang

Keywords Paper

text infilling, predicting text, writing tools, language modeling

0

0

0

0

7:01

19/04/2021

Evaluating the evaluation of diversity in natural language generation

Guy Tevet, Jonathan Berant

Keywords Paper

0

0

0

0

11:17

07/06/2021

Discovering and Categorising Language Biases in Reddit

Xavier Ferrer, Tom Van Nuenen, Jose M. Such, Natalia Criado

Keywords Paper

Qualitative and quantitative studies of social media, Social network analysis, communities identification, expertise and authority discovery, Subjectivity in textual data, sentiment analysis, polarity/opinion identification and extraction, linguistic analy

0

0

0

0

8:03

08/12/2020

SpanAlign: Sentence Alignment Method based on Cross-Language Span Prediction and ILP

Katsuki Chousa, Masaaki Nagata, Masaaki Nishino

Keywords Paper

0

0

0

0

14:39

08/12/2020

AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations

Lifeng Han, Gareth Jones, Alan Smeaton

Keywords Paper

0

0

0

0

14:26