Loading [MathJax]/extensions/Safe.js
arxiv.org
scholar.google.com
Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
Luong, Minh-Thang and Manning, Christopher D.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp


[link]
Summary by Denny Britz 9 years ago

TLDR; The authors train a word-level NMT where UNK tokens in both source and target sentence are replaced by character-level RNNs that produce word representations. The authors can thus train a fast word-based system that still generalized that doesn't produce unknown words. The best system achieves a new state of the art BLEU score of 19.9 in WMT'15 English to Czech translation.

Key Points
  • Source Sentence: Final hidden state of character-RNN is used as word representation.
  • Source Sentence: Character RNNs always initialized with 0 state to allow efficient pre-training
  • Target: Produce word-level sentence including UNK first and then run the char-RNNs
  • Target: Two ways to initialize char-RNN: With same hidden state as word-RNN (same-path), or with its own representation (separate-path)
  • Authors find that attention mechanism is critical for pure character-based NMT models
Notes
  • Given that the authors demonstrate the potential of character-based models, is the hybrid approach the right direction? If we had more compute power, would pure character-based models win?
more
Your comment:
[link]
Summary by Shagun Sodhani 8 years ago

Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models

Introduction

  • The paper presents a novel open vocabulary NMT(Neural Machine Translation) system that translates mostly at word level and falls back to character level models for rare words.
  • Advantages:
    • Faster and easier to train as compared to character models.
    • Does not produce unknown words in the translations which need to be removed using unk replacement techniques.
  • Link to the paper

Unk Replacement Technique

  • Most NMT operate on constrained vocabulary and represent unknown words with unk token.
  • A post-processing step replaces unk tokens with actual words using alignment information.
  • Disadvantages:
    • These systems treat words as independent entities while they are morphologically related.
    • Difficult to capture things like name translation.

Proposed Architecture

Word-level NMT
  • Deep LSTM encoder-decoder.
  • Global attention mechanism and bilinear attention scoring function.
  • Similar to regular NMT system except in the way unknown words are handled.
Character-level NMT
  • Deep LSTM model used to generate on-the-fly representation of rare words (using final hidden state from the top layer).
  • Advantages:
    • Simplified architecture.
    • Efficiency through precomputation - representations for rare sources words can be computed at once before each mini-batch.
    • The model can be trained easily in an end-to-end fashion.
Hidden-state Initialization
  • For source representation, layers of the LSTM are initialized with zero hidden states and cell values.
  • For target representation, the same strategy is followed except for the hidden state of the first layer where one of the following approaches are used:
    • same-path target generation approach
      • Use the context vector just before softmax (of word-level NMT).
    • seperate-path target generation approach
      • Learn a new weight matrix W that will be used to generate the context vector.
Training Objective
  • J = J<sub>w</sub> + αJ<sub>c</sub>
  • J - total loss
  • J<sub>w</sub> - loss in a regular word-level NMT
  • αJ<sub>c</sub> - loss in the character-level NMT
Word Character Generation Strategy
  • The final hidden state from character-level decoder could be interpreted as the representation of unk token but this approach would not be efficient.
  • Instead, unk is fed to the word-level decoder as it is so as to decouple the execution for the character-level model as soon the word-level model finishes.
  • During testing, a beam search decoder is run at the word level to find the best translation using the word NMT alone.
  • Next, a character-level encoder is used to generate the words in place of unk to minimise the combined loss.

Experiments

Data
  • WMT’15 translation task from English into Czech with newstest2013 (3000 sentences) as dev set and newstest2015 (2656 sentences) as a test set.
Metrics
  • Case-sensitive NIST BLEU.
  • chrF3
Models
  • Purely word based
  • Purely character based
  • Hybrid (proposed model)
Observations
  • Hybrid model surpasses all the other systems (neural/non-neural) and establishes a new state-of-the-art result for English-Czech translation in WMT’15 with 19.9 BLEU.
  • Character-level models, when used as a replacement for the standard unk replacement technique in NMT, yields an improvement of up to +7.9 BLEU points.
  • Attention is very important for character-based models as the non-attentional character models perform poorly.
  • Character models with shorter time-step backpropagation perform inferior as compared to ones with longer backpropagation.
  • Separate-path strategy outperforms same-path strategy.
Rare word embeddings
  • Obtain representations for rare words.
  • Compare the Spearman correlation between similarity scores assigned by humans and by the model.
  • Outperforms the recursive neural network model (which also uses a morphological analyser) on this task.
more
Your comment:

Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: