2kenize: Tying Subword Sequences for Chinese Script Conversion

04/07/2020

2kenize: Tying Subword Sequences for Chinese Script Conversion

- Pranav A, Isabelle Augenstein

Keywords: Chinese Conversion, Chinese NLP, mapping sequences, topic classification

Abstract Paper Similar Papers

Abstract: Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have insufficient performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code mixing and named entities.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ACL 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

02/02/2021

LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching

Boer Lyu, Lu Chen, Su Zhu, Kai Yu

Keywords Paper

0

0

0

0

15:57

16/11/2020

A Joint Multiple Criteria Model in Transfer Learning for Cross-domain Chinese Word Segmentation

Kaiyu Huang, Degen Huang, Zhuang Liu, Fengran Mo

Keywords Paper

natural, chinese segmentation, chinese, chinese tasks

0

0

0

0

10:49

02/02/2021

FontRL: Chinese Font Synthesis via Deep Reinforcement Learning

Yitian Liu, Zhouhui Lian

Keywords Paper

0

0

0

0

13:49

04/07/2020

Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge

Yuanhe Tian, Yan Song, Xiang Ao and
Fei Xia, Xiaojun Quan, Tong Zhang, Yonggang Wang

Keywords Paper

Chinese Segmentation, Part-of-speech Tagging, Chinese processing, joint tagging

0

0

0

0

11:53

01/07/2020

Incorporating Uncertain Segmentation Information into Chinese NER for Social Media Text

Shengbin Jia, Ling Ding, Xiaojun Chen and
Shijia E, Yang Xiang

Keywords Paper

0

0

0

0

18:51

01/07/2020

A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards

Zi-Yi Dou, Sachin Kumar, Yulia Tsvetkov

Keywords Paper

0

0

0

0

4:35

04/07/2020

Spelling Error Correction with Soft-Masked BERT

Shaohua Zhang, Haoran Huang, Jicong Liu, Hang Li

Keywords Paper

Spelling Correction, Chinese correction, Chinese CSC, error detection

0

0

0

0

11:34

04/07/2020

Pre-training via Leveraging Assisting Languages for Neural Machine Translation

Haiyue Song, Raj Dabre, Zhuoyuan Mao and
Fei Cheng, Sadao Kurohashi, Eiichiro Sumita

Keywords Paper

Neural Translation, S2S tasks, LOI, low-resource translation

0

0

0

0

12:04

19/08/2021

Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition

Jingye Chen, Bin Li, Xiangyang Xue

Keywords Paper

Computer Vision, Recognition

0

0

0

0

8:03

08/12/2020

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Chen Gong, Zhenghua Li, Bowei Zou, Min Zhang

Keywords Paper

0

0

0

0

14:48

05/12/2020

English-to-Chinese transliteration with phonetic auxiliary task

Yuan He, Shay B. Cohen

Keywords Paper

0

0

0

0

14:10

02/02/2021

Few-shot Font Generation with Localized Style Representations and Factorization

Song Park, Sanghyuk Chun, Junbum Cha and
Bado Lee, Hyunjung Shim

Keywords Paper

0

0

0

0

14:55

29/06/2020

Traceability support for multi-lingual software projects

Yalin Liu, Jinfeng Lin, Jane Cleland-Huang

Keywords Paper

Traceability, Cross-lingual information retrieval, Generalized Vector Space Model

0

0

0

0

13:23

08/12/2020

Synonym Knowledge Enhanced Reader for Chinese Idiom Reading Comprehension

Siyu Long, Ran Wang, Kun Tao and
Jiali Zeng, Xinyu Dai

Keywords Paper

0

0

0

0

9:58

02/02/2021

StrokeGAN: Reducing Mode Collapse in Chinese Font Generation via Stroke Encoding

Jinshan Zeng, Qi Chen, Yunxin Liu and
Mingwen Wang, Yuan Yao

Keywords Paper

0

0

0

0

17:02

01/07/2020

Robust Neural Machine Translation with ASR Errors

Haiyang Xue, Yang Feng, Shuhao Gu, Wei Chen

Keywords Paper

0

0

0

0

8:15

30/11/2020

Self-supervised Learning of Orc-Bert Augmentator for Recognizing Few-Shot Oracle Characters

Wenhui Han, Xinlin Ren, Hangyu Lin and
Yanwei Fu, Xiangyang Xue

Keywords Paper

0

0

0

0

7:38

05/01/2021

Handwritten Chinese Font Generation With Collaborative Stroke Refinement

Chuan Wen, Yujie Pan, Jie Chang and
Ya Zhang, Siheng Chen, Yanfeng Wang, Mei Han, Qi Tian

Keywords Paper

0

0

0

0

5:01

05/12/2020

Sina Mandarin alphabetical words:a web-driven code-mixing lexical resource

Rong Xiang, Mingyu Wan, Qi Su and
Chu-Ren Huang, Qin Lu

Keywords Paper

0

0

0

0

15:28

04/07/2020

A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing

Hang Yan, Xipeng Qiu, Xuanjing Huang

Keywords Paper

Joint Segmentation, Joint Parsing, Chinese segmentation, dependency parsing

0

0

0

0

8:15

16/11/2020

LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

Tasnim Mohiuddin, M Saiful Bari, Shafiq Joty

Keywords Paper

bilingual induction, bilingual, bli, semi-supervised method

0

0

0

0

12:09

19/04/2021

Don’t change me! User-controllable selective paraphrase generation

Mohan Zhang, Luchen Tan, Zihang Fu and
Kun Xiong, Jimmy Lin, Ming Li, Zhengkai Tu

Keywords Paper

0

0

0

0

6:03

08/12/2020

Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams

Yuanhe Tian, Yan Song, Fei Xia

Keywords Paper

0

0

0

0

14:53

25/07/2020

Chinese document classification with bi-directional convolutional language model

Bin Liu, Guosheng Yin

Keywords Paper

text classification, CNN, neural language model

0

0

0

0

9:17

02/02/2021

Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation

Ke Wang, Guandan Chen, Zhongqiang Huang and
Xiaojun Wan, Fei Huang

Keywords Paper

0

0

0

0

18:24

04/07/2020

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Ashish V. Thapliyal, Radu Soricut

Keywords Paper

Cross-modal Generation, Web-scale Coverage, Cross-modal tasks, Pivot Stabilization

0

0

0

0

11:43

16/11/2020

Surprisal Predicts Code-Switching in Chinese-English Bilingual Text

Jesús Calvillo, Le Fang, Jeremy Cole, David Reitter

Keywords Paper

code-switching, inhibition language, computational model, surprisal

0

0

0

0

11:29

02/02/2021

Ideography Leads Us to the Field of Cognition: A Radical-Guided Associative Model for Chinese Text Classification

Hanqing Tao, Shiwei Tong, Kun Zhang and
Tong Xu, Qi Liu, Enhong Chen, Min Hou

Keywords Paper

0

0

0

0

14:26

16/11/2020

Learning Adaptive Segmentation Policy for Simultaneous Translation

Ruiqing Zhang, Chuanqiang Zhang, Zhongjun He and
Hua Wu, Haifeng Wang

Keywords Paper

simultaneous translation, translation, segmentation, chinese-english translation

0

0

0

0

11:43

19/04/2021

Progressively pretrained dense corpus index for open-domain question answering

Wenhan Xiong, Hong Wang, William Yang Wang

Keywords Paper

0

0

0

0

12:15

14/06/2020

Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension

Zhenfang Chen, Peng Wang, Lin Ma and
Kwan-Yee K. Wong, Qi Wu

Keywords Paper

compositional referring expression comprehension, visual reasoning

0

0

0

0

1:00

14/06/2020

ContourNet: Taking a Further Step Toward Accurate Arbitrary-Shaped Scene Text Detection

Yuxin Wang, Hongtao Xie, Zheng-Jun Zha and
Mengting Xing, Zilong Fu, Yongdong Zhang

Keywords Paper

scene text detection, arbitrary shapes, false-positive suppression, large scale variance

0

0

0

0

1:01

04/07/2020

Paraphrase Generation by Learning How to Edit from Samples

Amirhossein Kazemnejad, Mohammadreza Salehi, Mahdieh Soleymani Baghshah

Keywords Paper

Paraphrase Generation, Neural sequence, sequence generation, retrieval-based method

0

0

0

0

12:20

02/02/2021

Hierarchical Macro Discourse Parsing Based on Topic Segmentation

Feng Jiang, Yaxin Fan, Xiaomin Chu and
Peifeng Li, Qiaoming Zhu, Fang Kong

Keywords Paper

0

0

0

0

17:28

30/11/2020

Scale-Aware Polar Representation for Arbitrarily-Shaped Text Detection

Yanguang Bi, Zhiqiang Hu

Keywords Paper

0

0

0

0

9:56

04/07/2020

SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check

Xingyi Cheng, Weidi Xu, Kunlong Chen and
Shaohua Jiang, Feng Wang, Taifeng Wang, Wei Chu, Yuan Qi

Keywords Paper

Chinese Check, spelling errors, spelling language, CSC

0

0

0

0

10:27

16/11/2020

Generating Diverse Translation from Model Distribution with Dropout

Xuanfu Wu, Yang Feng, Chenze Shao

Keywords Paper

neural, inference, chinese-english tasks, nmt

0

0

0

0

11:09

06/12/2021

Few-Shot Segmentation via Cycle-Consistent Transformer

Gengwei Zhang, Guoliang Kang, Yi Yang, Yunchao Wei

Keywords Paper

transformers, vision, few shot learning

0

0

0

0

11:58

16/11/2020

Entity Enhanced BERT Pre-training for Chinese NER

Chen Jia, Yuefeng Shi, Qinrong Yang, Yue Zhang

Keywords Paper

chinese ner, pre-training, ner fine-tuning, ner

0

0

0

0

9:39

22/06/2020

XREF: Entity Linking for Chinese News Comments with Supplementary Article Reference

Xinyu Hua, Lei Li, Lifeng Hua, Lu Wang

Keywords Paper

Entity Linking, Chinese social media, Data Augmentation

0

0

0

0

5:23