08/12/2020

Multi-grained Chinese Word Segmentation with Weakly Labeled Data

Chen Gong, Zhenghua Li, Bowei Zou, Min Zhang

Keywords:

Abstract: In contrast with the traditional single-grained word segmentation (SWS), where a sentence corresponds to a single word sequence, multi-grained Chinese word segmentation (MWS) aims to segment a sentence into multiple word sequences to preserve all words of different granularities. Due to the lack of manually annotated MWS data, previous work train and tune MWS models only on automatically generated pseudo MWS data. In this work, we further take advantage of the rich word boundary information in existing SWS data and naturally annotated data from dictionary example (DictEx) sentences, to advance the state-of-the-art MWS model based on the idea of weak supervision. Particularly, we propose to accommodate two types of weakly labeled data for MWS, i.e., SWS data and DictEx data by employing a simple yet competitive graph-based parser with local loss. Besides, we manually annotate a high-quality MWS dataset according to our newly compiled annotation guideline, consisting of over 9,000 sentences from two types of texts, i.e., canonical newswire (NEWS) and non-canonical web (BAIKE) data for better evaluation. Detailed evaluation shows that our proposed model with weakly labeled data significantly outperforms the state-of-the-art MWS model by 1.12 and 5.97 on NEWS and BAIKE data in F1.

The video of this talk cannot be embedded. You can watch it here:
https://underline.io/lecture/6149-multi-grained-chinese-word-segmentation-with-weakly-labeled-data
(Link will open in new window)
 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at COLING 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd

Similar Papers