Summaries from Association for Computational Linguistics on ShortScience.org

www.aclweb.org
sci-hub
scholar.google.com

Improving Word Representations via Global Context and Multiple Word Prototypes
Huang, Eric H. and Socher, Richard and Manning, Christopher D. and Ng, Andrew Y.
Association for Computational Linguistics - 2012 via Local Bibsonomy
Keywords: nlp

[link] Summary by Shagun Sodhani 8 years ago

# Improving Word Representations via Global Context and Multiple Word Prototypes

## Introduction

* This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that
    * combines local and global context while learning word embeddings to capture the word semantics.
    * learns multiple embeddings per word to account for homonymy and polysemy.
* [Link to the paper](http://www.aclweb.org/anthology/P12-1092)

## Global Context-Aware Neural Language Model

### Training Objective

* Given a word sequence *s* (local context) and a document *d* in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in *s* from other words.
* *g(s, d)* - scoring function giving liklihood of correct sequence.
* *g(s<sup>w</sup>, d)* - scoring function giving liklihood of last word in *s* repalced by a word *w*.
* Objective - *g(s, d)* > *g(s<sup>w</sup>, d)* + 1 for any other word *w*.

### Architecture

* Two scoring components (neural networks) to capture:
    
    * Local Context
        * Map word sequence *s* into an ordered list of vectors *x = [x<sub>1</sub>, ..., x<sub>m</sub>]*.
        * *x<sub>i</sub>* - embedding corresponding to *i<sup>th</sup>* word in the sequence.
        * Compute local score *score<sub>l</sub>* by using a neural network (with one hidden layer) over *x*.
        * Preserves word order and syntactic information.
    * Global Context
        * Map document *d* to an ordered list of word embeddings, *d = (d<sub>1</sub>, ..., d<sub>k</sub>)*.
        * Compute *c*, the weighted average of all word vectors in document.
        * The paper uses *idf* score for weighting the documents.
        * *x = * concatenation of *c* and vector of the last word in *s*.
        * Compute global score *score<sub>g</sub>* by using a neural network (with two hidden layers) over *x*.
        * Similar to bag-of-words features.
    *score = score<sub>l</sub> + score<sub>g</sub>*
    * Train the weights of the hidden layers and the word embeddings.

### Multi-Prototype Neural Language Model

* Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.
* Solution - train multiple vectors per word to capture the different meanings.
* Approach
    
    * Gather all the fixed-sized context windows for all occurrences of a given word.
    * Find the context vector by performing weighted averaging of all the words in the context window.
    * Cluster the context vectors using spherical k-means.
    * Each word occurrence in the corpus is re-labeled to its associated cluster.
    * To find similarity between a pair of words *(w, w')*:
        * For each possible cluster of *i* and *j* corresponding to the words *w* and *w'*, find distance between cluster centers for *i* and *j* and weight them by the product of probabilities of *w* belonging to *i* and *w'* belonging to *j* given their respective contexts.
        * Average the value over the *k<sup>2</sup>* pairs.

## Training
    
* Dataset
  * Wikipedia corpus

* Parameters
  * 10-word windows
  * 100 hidden units
  * No weight regularization
  * 10 different word embeddings learnt for words having multiple meanings.

## Evaluation

* Dataset
  * WordSim-353 
      * 353 pairs of nouns
      * words represented without context
      * contains human similarity judgements on pair of words
  * The paper contributed a new dataset
      * captures human similarity judgements on pair of words in the context of a sentence
      * consists of verbs and adjectives along with nouns
      * for details on how the dataset is constructed, refer the paper

* Performance
  * Proposed model achieves higher correlation to human scores than models using only the local or global context.
  * Performance can be improved by removing the stop words.
  * Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.

## Comments

* This work predated the more general word embedding models like [Word2Vec](https://gist.github.com/shagunsodhani/176a283e2c158a75a0a6) and [Glove](https://gist.github.com/shagunsodhani/efea5a42d17e0fcf18374df8e3e4b3e8). While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.

Improving Word Representations via Global Context and Multiple Word Prototypes

Introduction

This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that
- combines local and global context while learning word embeddings to capture the word semantics.
- learns multiple embeddings per word to account for homonymy and polysemy.
Link to the paper

Global Context-Aware Neural Language Model

Training Objective

Given a word sequence s (local context) and a document d in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in s from other words.
g(s, d) - scoring function giving liklihood of correct sequence.
g(sw, d) - scoring function giving liklihood of last word in s repalced by a word w.
Objective - g(s, d) > g(sw, d) + 1 for any other word w.

Architecture

Two scoring components (neural networks) to capture:
- Local Context
 - Map word sequence s into an ordered list of vectors x = [x1, ..., xm].
 - xi - embedding corresponding to ith word in the sequence.
 - Compute local score scorel by using a neural network (with one hidden layer) over x.
 - Preserves word order and syntactic information.
- Global Context
 - Map document d to an ordered list of word embeddings, d = (d1, ..., dk).
 - Compute c, the weighted average of all word vectors in document.
 - The paper uses idf score for weighting the documents.
 - x = concatenation of c and vector of the last word in s.
 - Compute global score scoreg by using a neural network (with two hidden layers) over x.
 - Similar to bag-of-words features. score = scorel + scoreg
- Train the weights of the hidden layers and the word embeddings.

Multi-Prototype Neural Language Model

Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.
Solution - train multiple vectors per word to capture the different meanings.
Approach
- Gather all the fixed-sized context windows for all occurrences of a given word.
- Find the context vector by performing weighted averaging of all the words in the context window.
- Cluster the context vectors using spherical k-means.
- Each word occurrence in the corpus is re-labeled to its associated cluster.
- To find similarity between a pair of words (w, w'):
 - For each possible cluster of i and j corresponding to the words w and w', find distance between cluster centers for i and j and weight them by the product of probabilities of w belonging to i and w' belonging to j given their respective contexts.
 - Average the value over the k2 pairs.

Training

Dataset
- Wikipedia corpus
Parameters
- 10-word windows
- 100 hidden units
- No weight regularization
- 10 different word embeddings learnt for words having multiple meanings.

Evaluation

Dataset
- WordSim-353
  - 353 pairs of nouns
  - words represented without context
  - contains human similarity judgements on pair of words
- The paper contributed a new dataset
  - captures human similarity judgements on pair of words in the context of a sentence
  - consists of verbs and adjectives along with nouns
  - for details on how the dataset is constructed, refer the paper
Performance
- Proposed model achieves higher correlation to human scores than models using only the local or global context.
- Performance can be improved by removing the stop words.
- Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.

Comments

This work predated the more general word embedding models like Word2Vec and Glove. While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.