Enhance word representation for out-of-vocabulary on Ubuntu dialogue corpus.
Ubuntu dialogue corpus is the largest public available dialogue corpus tomake it feasible to build end-to-end deep neural network models directly fromthe conversation data. One challenge of Ubuntu dialogue corpus is the largenumber of out-of-vocabulary words. In this paper we proposed a method whichcombines the general pre-trained word embedding vectors with those generated onthe task-specific training set to address this issue. We integrated characterembedding into Chen et al's Enhanced LSTM method (ESIM) and used it to evaluatethe effectiveness of our proposed method. For the task of next utteranceselection, the proposed method has demonstrated a significant performanceimprovement against original ESIM and the new model has achievedstate-of-the-art results on both Ubuntu dialogue corpus and Douban conversationcorpus. In addition, we investigated the performance impact of end-of-utteranceand end-of-turn token tags.
Continue reading and listening
Stay in the loop.
Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.