OCR Post Correction for Endangered Language Texts

16/11/2020

OCR Post Correction for Endangered Language Texts

Shruti Rijhwani, Antonios Anastasopoulos, Graham Neubig

Keywords: natural models, general-purpose tools, ocr method, recognition rate

Abstract Paper Similar Papers

Abstract: There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at EMNLP 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

19/04/2021

Multilingual and cross-lingual document classification: A meta-learning approach

Niels Heijden, Helen Yannakoudakis, Pushkar Mishra, Ekaterina Shutova

Keywords Paper

0

0

0

0

11:51

16/11/2020

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Charles Welch, Rada Mihalcea, Jonathan K. Kummerfeld

Keywords Paper

nlp applications, language model, language research, byte-pair encoding

0

0

0

0

5:12

02/02/2021

Simple or Complex? Learning to Predict Readability of Bengali Texts

Susmoy Chakraborty, Mir Tafseer Nayeem, Wasi Uddin Ahmad

Keywords Paper

0

0

0

0

16:31

04/07/2020

LINSPECTOR: Multilingual Probing Tasks for Word Representations

Gözde Gül Sahin, Clara Vania, Ilia Kuznetsov, Iryna Gurevych

Keywords Paper

Word Representations, NLP, classification tasks, probing tasks

0

0

0

0

11:51

04/07/2020

Enabling Language Models to Fill in the Blanks

Chris Donahue, Mina Lee, Percy Liang

Keywords Paper

text infilling, predicting text, writing tools, language modeling

0

0

0

0

7:01

16/11/2020

IIRC: A Dataset of Incomplete Information Reading Comprehension Questions

James Ferguson, Matt Gardner, Hannaneh Hajishirzi and
Tushar Khot, Pradeep Dasigi

Keywords Paper

reading tasks, reading datasets, iirc, discrete reasoning

0

0

0

0

10:31

16/11/2020

Text Classification Using Label Names Only: A Language Model Self-Training Approach

Yu Meng, Yunyi Zhang, Jiaxin Huang and
Chenyan Xiong, Heng Ji, Chao Zhang, Jiawei Han

Keywords Paper

classification, category understanding, document classification, topic classification

0

0

0

0

11:38

03/05/2021

Taking Notes on the Fly Helps Language Pre-Training

Qiyu Wu, Chen Xing, Yatao Li and
Guolin Ke, Di He, Tie-Yan Liu

Keywords Paper

Natural Language Processing, Pre-training

0

0

0

0

5:21

19/04/2021

Disfluency correction using unsupervised and semi-supervised learning

Nikhil Saini, Drumil Trivedi, Shreya Khare and
Tejas Dhamecha, Preethi Jyothi, Samarth Bharadwaj, Pushpak Bhattacharyya

Keywords Paper

0

0

0

0

7:13

16/11/2020

Tackling the Low-resource Challenge for Canonical Segmentation

Manuel Mager, Özlem Çetinoğlu, Katharina Kann

Keywords Paper

morphological generation, canonical segmentation, lstm pointer-generator, sequence-to-sequence model

0

0

0

0

11:55

16/11/2020

New Protocols and Negative Results for Textual Entailment Data Collection

Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, Emily Pitler

Keywords Paper

benchmarking, language understanding, transfer applications, crowdsourcing protocol

0

0

0

0

12:27

16/11/2020

Cross-lingual Spoken Language Understanding with Regularized Representation Alignment

Zihan Liu, Genta Indra Winata, Peng Xu and
Zhaojiang Lin, Pascale Fung

Keywords Paper

spoken systems, cross-lingual task, few-shot setting, cross-lingual models

0

0

0

0

9:40

26/04/2020

Pre-training Tasks for Embedding-based Large-scale Retrieval

Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang and
Yiming Yang, Sanjiv Kumar

Keywords Paper

natural language processing, large-scale retrieval, unsupervised representation learning, paragraph-level pre-training, two-tower Transformer models

0

0

0

1

4:39

16/11/2020

IGT2P: From Interlinear Glossed Texts to Paradigms

Sarah Moeller, Ling Liu, Changbing Yang and
Katharina Kann, Mans Hulden

Keywords Paper

linguistic analysis, natural systems, igt-to-paradigms, igtp

0

0

0

0

11:29

08/12/2020

Cross-lingual Transfer Learning for Grammatical Error Correction

Ikumi Yamashita, Satoru Katsumata, Masahiro Kaneko and
Aizhan Imankulova, Mamoru Komachi

Keywords Paper

0

0

0

0

14:32

08/12/2020

TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching

Heng Gong, Yawei Sun, Xiaocheng Feng and
Bing Qin, Wei Bi, Xiaojiang Liu, Ting Liu

Keywords Paper

0

0

0

0

8:45

04/07/2020

From English to Code-Switching: Transfer Learning with Strong Morphological Clues

Gustavo Aguilar, Thamar Solorio

Keywords Paper

natural processing, CS, language identification, CS tasks

0

0

0

0

14:26

16/11/2020

Simultaneous Machine Translation with Visual Context

Ozan Caglayan, Julia Ive, Veneta Haralampieva and
Pranava Madhyastha, Loïc Barrault, Lucia Specia

Keywords Paper

simt, multimodal approaches, simt frameworks, visually-grounded models

0

0

0

0

12:34

19/04/2021

Does she wink or does she nod? A challenging benchmark for evaluating word understanding of language models

Lutfi Kerem Senel, Hinrich Schütze

Keywords Paper

0

0

0

0

7:43

16/11/2020

XL-AMR: Enabling Cross-Lingual AMR Parsing with Transfer Learning Techniques

Rexhina Blloshmi, Rocco Tripodi, Roberto Navigli

Keywords Paper

encoding semantics, cross-lingual parsing, english parsing, amr

0

0

0

0

11:20

01/07/2020

Adapting End-to-End Speech Recognition for Readable Subtitles

Danni Liu, Jan Niehues, Gerasimos Spanakis

Keywords Paper

0

0

0

0

22:16

05/12/2020

IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding

Bryan Wilie, Karissa Vincentio, Genta Indra Winata and
Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti

Keywords Paper

0

0

0

0

13:55

22/11/2021

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

Subhabrata Choudhury, Iro Laina, Christian Rupprecht, Andrea Vedaldi

Keywords Paper

fine-grained recognition, weakly-supervised recognition, fine-grained retrieval, unsupervised recognition, image-to-text retrieval, text-to-image retrieval, image classification

0

0

0

0

8:53

16/11/2020

Do sequence-to-sequence VAEs learn global features of sentences?

Tom Bosc, Pascal Vincent

Keywords Paper

generation, memorization, autoregressive models, variational autoencoder

0

0

0

0

12:00

04/07/2020

Learning and Evaluating Emotion Lexicons for 91 Languages

Sven Buechel, Susanna Rücker, Udo Hahn

Keywords Paper

sentiment analysis, downstream applications, lexicon creation, bilingual model

0

0

0

0

12:42

16/11/2020

Effectively pretraining a speech translation decoder with Machine Translation data

Ashkan Alinejad, Anoop Sarkar

Keywords Paper

automatic task, neural task, speech translation, end-to-end approach

0

0

0

0

6:12

01/07/2020

A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards

Zi-Yi Dou, Sachin Kumar, Yulia Tsvetkov

Keywords Paper

0

0

0

0

4:35

19/04/2021

Generating syntactically controlled paraphrases without using annotated parallel pairs

Kuan-Hao Huang, Kai-Wei Chang

Keywords Paper

0

0

0

1

10:41

02/02/2021

Few-shot Font Generation with Localized Style Representations and Factorization

Song Park, Sanghyuk Chun, Junbum Cha and
Bado Lee, Hyunjung Shim

Keywords Paper

0

0

0

0

14:55

08/12/2020

Text Classification by Contrastive Learning and Cross-lingual Data Augmentation for Alzheimer’s Disease Detection

Zhiqiang Guo, Zhaoci Liu, Zhenhua Ling and
Shijin Wang, Lingjing Jin, Yunxia Li

Keywords Paper

0

0

0

0

13:12

14/09/2020

Inductive Document Representation Learning for Short Text Clustering

Junyang Chen, Zhiguo Gong, Xiao Dong and
Wei Wang, Wei Wang, Weiwen Liu, Cong Wang, Xian Chen

Keywords Paper

0

0

0

0

10:45

04/07/2020

Hypernymy Detection for Low-Resource Languages via Meta Learning

Changlong Yu, Jialong Han, Haisong Zhang, Wilfred Ng

Keywords Paper

Hypernymy Detection, lexical entailment, natural tasks, monolingual detection

0

0

0

0

6:53

19/04/2021

MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark

Haoran Li, Abhinav Arora, Shuohui Chen and
Anchit Gupta, Sonal Gupta, Yashar Mehdad

Keywords Paper

0

0

0

0

11:51

16/11/2020

Multi-resolution Annotations for Emoji Prediction

Weicheng Ma, Ruibo Liu, Lili Wang, Soroush Vosoughi

Keywords Paper

natural tasks, emojis, linguistic components, multi-class setting

0

0

0

0

11:52

04/07/2020

A Multitask Learning Approach for Diacritic Restoration

Sawsan Alqahtani, Ajay Mishra, Mona Diab

Keywords Paper

Diacritic Restoration, computational processing, restoring diacritics, NLP problems

0

0

0

0

14:12

08/12/2020

Emergent Communication Pretraining for Few-Shot Machine Translation

Yaoyiran Li, Edoardo Maria Ponti, Ivan Vulić, Anna Korhonen

Keywords Paper

0

0

0

0

14:42

19/04/2021

Leveraging end-to-end ASR for endangered language documentation: An empirical study on yolóxochitl Mixtec

Jiatong Shi, Jonathan D. Amith, Rey Castillo Garcı́a and
Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe

Keywords Paper

0

0

0

0

11:23

19/04/2021

On the (in)effectiveness of images for text classification

Chunpeng Ma, Aili Shen, Hiyori Yoshikawa and
Tomoya Iwakura, Daniel Beck, Timothy Baldwin

Keywords Paper

0

0

0

0

6:15

02/02/2021

Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

Yubei Xiao, Ke Gong, Pan Zhou and
Guolin Zheng, Xiaodan Liang, Liang Lin

Keywords Paper

0

0

0

0

14:04

16/11/2020

Topic Modeling in Embedding Spaces

Adji Bousso Dieng, Francisco Ruiz, David Blei

Keywords Paper

generative documents, topic modeling, topic models, embedded model

0

0

0

0

12:46