Loading [MathJax]/extensions/Safe.js


A conference in the field of computational linguistics, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language.

[link]
Summary by Shagun Sodhani 8 years ago

Improving Word Representations via Global Context and Multiple Word Prototypes

Introduction

  • This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that
    • combines local and global context while learning word embeddings to capture the word semantics.
    • learns multiple embeddings per word to account for homonymy and polysemy.
  • Link to the paper

Global Context-Aware Neural Language Model

Training Objective
  • Given a word sequence s (local context) and a document d in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in s from other words.
  • g(s, d) - scoring function giving liklihood of correct sequence.
  • g(s<sup>w</sup>, d) - scoring function giving liklihood of last word in s repalced by a word w.
  • Objective - g(s, d) > g(s<sup>w</sup>, d) + 1 for any other word w.
Architecture
  • Two scoring components (neural networks) to capture:

    • Local Context
      • Map word sequence s into an ordered list of vectors x = [x<sub>1</sub>, ..., x<sub>m</sub>].
      • x<sub>i</sub> - embedding corresponding to i<sup>th</sup> word in the sequence.
      • Compute local score score<sub>l</sub> by using a neural network (with one hidden layer) over x.
      • Preserves word order and syntactic information.
    • Global Context
      • Map document d to an ordered list of word embeddings, d = (d<sub>1</sub>, ..., d<sub>k</sub>).
      • Compute c, the weighted average of all word vectors in document.
      • The paper uses idf score for weighting the documents.
      • x = concatenation of c and vector of the last word in s.
      • Compute global score score<sub>g</sub> by using a neural network (with two hidden layers) over x.
      • Similar to bag-of-words features. score = score<sub>l</sub> + score<sub>g</sub>
    • Train the weights of the hidden layers and the word embeddings.
Multi-Prototype Neural Language Model
  • Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.
  • Solution - train multiple vectors per word to capture the different meanings.
  • Approach

    • Gather all the fixed-sized context windows for all occurrences of a given word.
    • Find the context vector by performing weighted averaging of all the words in the context window.
    • Cluster the context vectors using spherical k-means.
    • Each word occurrence in the corpus is re-labeled to its associated cluster.
    • To find similarity between a pair of words (w, w'):
      • For each possible cluster of i and j corresponding to the words w and w', find distance between cluster centers for i and j and weight them by the product of probabilities of w belonging to i and w' belonging to j given their respective contexts.
      • Average the value over the k<sup>2</sup> pairs.

Training

  • Dataset

    • Wikipedia corpus
  • Parameters

    • 10-word windows
    • 100 hidden units
    • No weight regularization
    • 10 different word embeddings learnt for words having multiple meanings.

Evaluation

  • Dataset

    • WordSim-353
      • 353 pairs of nouns
      • words represented without context
      • contains human similarity judgements on pair of words
    • The paper contributed a new dataset
      • captures human similarity judgements on pair of words in the context of a sentence
      • consists of verbs and adjectives along with nouns
      • for details on how the dataset is constructed, refer the paper
  • Performance

    • Proposed model achieves higher correlation to human scores than models using only the local or global context.
    • Performance can be improved by removing the stop words.
    • Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.

Comments

  • This work predated the more general word embedding models like Word2Vec and Glove. While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.
more
Send Feedback
ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: