Chinese content scoring: Open-access datasets and features on different segmentation levels

05/12/2020

Chinese content scoring: Open-access datasets and features on different segmentation levels

Yuning Ding, Andrea Horbach, Torsten Zesch

Keywords:

Abstract: In this paper, we analyse the challenges of Chinese content scoring in comparison to English. As a review of prior work for Chinese content scoring shows a lack of open-access data in the field, we present two short-answer data sets for Chinese. The Chinese Educational Short Answers data set (CESA) contains 1800 student answers for five science-related questions. As a second data set, we collected ASAP-ZH with 942 answers by re-using three existing prompts from the ASAP data set. We adapt a state-of-the-art content scoring system for Chinese and evaluate it in several settings on these data sets. Results show that features on lower segmentation levels such as character n-grams tend to have better performance than features on token level.

Chinese content scoring: Open-access datasets and features on different segmentation levels

Yuning Ding, Andrea Horbach, Torsten Zesch

Comments

Similar Papers