[paper review] Efficient Estimation of Word Representations in Vector Space (Word2Vec)

Abstract

Propose two model architrectures for computing continuous word vector (CBOW, Skip-Gram)
Propose two modified NNLM model using binary tree
Introduce new method to measure word similarity
Show that newly introduced models can be trained on very large model (reduced time complexity)

Previous Works

기존에 있었던 word vector 학습 모델은 크게 두 가지로 나눌 수 있었다.

1. SVD(Singular Value Decomposition) 기반 모델

LSD(Latent Semantic Analysis) 등이 해당한다. Singular value decomposition을 DTM에 적용하면 $A=U \Sigma V^T$ 형태의 식을 얻을 수 있는데, 이 가운데 행렬 $\Sigma$는 diagonal term에 singular value를 내림차순으로 가지는 diagonal matrix이다. 이 중 몇 개의 singular value만 남겨두고 나머지는 제거하여 noise를 제거하고 차원을 축소할 수 있는데, 이를 Truncated SVD라고 한다. DTM이나 TF-IDF 행렬에 truncated SVD를 적용하여 의미를 가진 vector를 얻어낸다.

2. Neural network 기반 모델

2.1 NNLM

반면 이 논문에서 관심을 가지는 대상은 neural network 기반 모델이다. 이 모델은 크게 NNLM 모델과 RNNLM 모델이 있는데, 이 두 가지를 나누어 설명해보고자 한다.

NNLM(feedforward Neural Network Language Model)은 Bengio(2003)에 의해서 제안된 모델이고, neural network를 사용해 실수 범위 내에 있는 continuous word vector를 만드는 방식이다.

NNLM은 크게 네 개의 layer로 구성된다.

input layer
projection layer
hidden layer
output layer

input layer의 input으로는 각 단어의 one-hot encoding vector가 들어간다. center word 이전과 이후 $N$개의 단어를 뽑아서 총 $N$개의 one-hot vector$(1 \times v)$가 common weight matrix$(v \times m)$와 inner product되어 $(1 \times m)$ 사이즈 벡터가 총 $N$개 얻어진다. 이 row vector를 vertical하게 concatenation시켜 $n \times m$ 크기의 matrix를 하나 얻게 되는데 이게 projection layer이다.

이를 수식으로 나타내면 다음과 같다.

$$p^{layer}= (lookup(x_{t-n};, ... lookup(x_{t-1}) $$

projection layer가 여타 hidden layer와 비교되는 점은, input vector의 concatenation으로만 이루어지기도 하지만, activation function이 매개하지 않고 단순히 lookup table의 한 row 값이 그대로 전달된다는 점에 있다. input vector가 one-hot vector이기 때문이다.

이 뒤 hidden layer와 output layer 사이에는 각각 $tanh$ function과 $softmax$ function이 매개하여 word vector를 만들어낸다.

이때 computation complexity는 다음과 같이 계산된다.

$$ Q = N \times D + N \times D \times H + H \times V $$

첫 term은 projection인데, 원래는 $N \times V \times D$지만 one-hot vector이므로 $V$가 사라진다. 결국 time complexity에서 대부분을 차지하는 term은 $ H \times V$이다. softmax를 계산하기 위해서 매우 많은 양의 inner product 연산을 해야하기 때문이다.

이 논문에서는 softmax function을 hierarchical softmax를 사용하여 $H \log{V}$로 크게 줄인다. 이 방법으로 큰 corpus에 대해서도 모델이 학습할 수 있게 된다.

2.2 RNNLM

RNNLM(Rcurrent Neural Network Language Model)은 이름에서 볼 수 있듯이 RNN의 개념을 활용하는 방법이다. 이전 NNLM은 제한된 window size 내에 있는 단어들만 활용할 수 있었다면 RNNLM은 전체 크기의 단어를 활용할 수 있다.

자세한 내용은 https://wikidocs.net/46496에 잘 나와있으므로 참조하면 좋다.

time complexity는 다음과 같다 :

$$Q = H \times H + H \times V$$

이 경우에도 마지막 term이 전체 time complexity를 dominate하는데, 역시 hierarchical softmax로 $H \log{V}$로 크게 줄어들 수 있다.

New Log-linear Models

여기서는 새로운 model architecture 두 가지를 제안한다. CBOW 모델과 Skip-Gram 모델이 제안되는데 둘을 나누어서 알아본다.

이 두 가지 모델은 다음을 전제로 한다.

You shall know a word by the company it keeps - John Rupert Firth

한 단어를 파악하기 위해서 주변에 있는 단어들의 정보들의 분포를 벡터로 나타내면 그 단어를 표현할 수 있다는 것이다. 이 생각이 Word2Vec 모델들의 기반이다.

CBOW (Continuous Bag Of Words)

CBOW는 주변에 있는 단어(context word)로 중심 단어(center word)를 예측하는 방식으로 학습하는 neural network 모델이다.

모델의 형태는 위와 같다. NNLM과 비교해 보면, projection layer 뒤와 output layer 사이의 non-linear hidden layer가 제거되어 있다. 또한 다른 점은 NNLM은 모든 단어 벡터가 lookup table의 row값을 가져와서 concatenate하는 것이었다면, 여기서는 모든 word에 weight 내적값이 더해져서 평균값으로 얻어진다는 것이다.

자세한 설명은 다음 figure과 함께 보자.

사진 출처 : CS224n: Natural Language Processing with Deep Learning Lecture Notes: Part I Word Vectors I: Introduction, SVD and Word2Vec

1. center word 앞 뒤로 추론에 사용할 window size $m$를 정한다. 논문에서는 앞뒤로 4개의 단어를 사용했을 때 최고의 performance를 얻을 수 있었다고 한다.

2. row vector인 context word들을 세로로 concatenate한다. context word는 one-hot vector이므로 $(v \times 1)$이고, 앞뒤로 $m$개의 단어를 뽑으면 $(2m \times v)$ 크기의 matrix이다.

3. 이 matrix를 weight matrix $W^1_{(v \times M)}$와 multiplicate하여 embedded word matrix $V_{(2m \times M)}$을 얻는다.

4. $V_{(2m \times M)}$를 세로 방향으로 average하여 하나의 hidden layer vector $h_{(1 \times M)}$를 얻는다.

5. $h_{(1 \times M)}$와 두 번째 weight matrix $W^2_{(M*V)}$와 multiplicate하여 $(1 \times v)$ 크기의 vector $y'$을 얻는다.

6. $y'$를 softmax function에 넣어 output vector $y$를 얻는다. 이를 정답의 one-hot vector와 cross-entropy loss를 이용하여 $W^1$와 $W^2$를 학습한다.

time complexity

이 떄 CBOW의 한 word당 time complexity는 다음과 같다 :

$$Q = N \times D + D \times V$$

여기서도 $V$가 매우 크므로 hierarchical softmax를 사용해서 log로 줄인다. 그러면 최종적인 time complexity는 다음과 같다.

$$Q = N \times D + D \times \log{ V}$$

Loss function

자세한 수식 전개 과정은 cs224n Lecture note 1을 인용한다 :

Skip-Gram

Skip-Gram model은 CBOW와 매우 유사하다. 다만 CBOW는 context word를 가지고 center word를 predict한다면, Skip-Gram model에서는 center word를 가지고 context word 모두를 predict한다.

다음 그림을 보면서 단계별로 살펴보면,

1. input vector $x_{(1*v)}$과 weight matrix $W^1_{(V \times N)}$을 곱하여 $h_{(1*N)}$를 얻는다.

2.$h_{(1*N)}$에 다시 weight matrix $W^2_{(N \times N)}$을 곱하여 $y'_{(1 \times v}$를 얻는다.

3. $y'_{(1 \times v}$에 softmax를 적용하여 $y$를 얻고 cross entropy loss로 weight를 학습한다.

Q1. 왜 $2m \times v$ matrix가 아니라 $v$ dimensional vector가 결과값인가?

여기서 내가 이해에 어려움을 겪었었는데.. 그림이나 수식 설명에서는 한 번에 $2m$개(위 figure에서는 $C$개)의 output vector를 만들어야 할 것 같은데 실제로 계산해 보면 $(1 \times v)$ 벡터밖에 안 나온다. 그래서 implement된 코드들을 여러 개 찾아봤는데도 모델에서는 $v$ dimensional vector 하나만 predict한다.

이는 실제로 구현할 때에는 한 iteration에서 $2m$($C$)개의 vector를 output으로 받아내는 것이 아니라, 한 번에 한 개의 context word만 예측하기 때문에 발생한 혼란이다. 실제로는 여러 개의 pair를 만들어 prediction을 한다.

다음과 같은 예를 생각해 보자.

"I told my father that I wanted to go to New York."

window size는 2로 설정하면 다음과 같은 pair들이 생긴다. (center word, context word)

(I, told), (I, my)

(told, I), (told my), (told, father)

...

(York, to), (York New)

만약 center word가 'father'이라면 다음 pair들에 대해 prediction과 training을 하면 된다.

(father, told), (father, my), (father, that), (father, wanted)

실제 implement된 코드도 pair를 만드는 것을 볼 수 있다.

# code by Tae Hwan Jung @graykode

batch_size = 1 # mini-batch size
embedding_size = 2 # embedding size

sentences = ["apple banana fruit", "banana orange fruit", "orange banana fruit",
                 "dog cat animal", "cat monkey animal", "monkey dog animal"]

word_sequence = " ".join(sentences).split()
word_list = " ".join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)}
voc_size = len(word_list)

# Make skip gram of one size window
skip_grams = []
for i in range(1, len(word_sequence) - 1):
    target = word_dict[word_sequence[i]]
    context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]]
    for w in context:
        skip_grams.append([target, w])

전체 코드는 https://github.com/graykode/nlp-tutorial에서 확인할 수 있다.

Loss function

Q1의 형태는 왜 가능한가? Skip-Gram 모델에서는 output word들이 모두 independent하다는 Naive assuption을 가정하기 때문이다. 그렇기 때문에,

$$P(w_{c-m},...,w_{c-1},w_{c+1},...w_{c+m}|w_{c})$$

은 단순한 곱으로 표현될 수 있다.

$$P(w_{c-m}|w_{c}) \times P(w_{c-m+1}|w_{c}) \times \cdots P(w_{c+m}|w_{c})$$

이때 이들을 합으로 나타내기 위해서 -log를 씌우면 합의 형태가 된다. 따라서 여러 pair들이 각각 독립적인 iteration에서 gradient를 얻어서 update해도 문제되지 않는 것이다.

자세한 수식 전개 과정은 cs224n Lecture note 1을 인용한다 :

Word Relationship Task

논문에서 사용한 방법은 다음과 같은 단어 간 미묘한 관계를 함의한 질문을 하는 것이다.

What is the word that is similar to small in the same sense as biggest is similar to big?

이 정답은 smallest라고 생각할 수 있다. 다만, 단어 간 의미 관계를 정확하게 파악하고 있지 않다면 기계가 이를 대답하는 것은 매우 어려울 것이다.

하지만 놀랍게도 이러한 task가 간단한 대수적 연산을 통해 확인될 수 있음이 드러났다.

$X = vector(biggest) - vector(big) + vector(small)$을 계산한 뒤 $X$와 word space에서 cosine distance가 가장 가까운 word vector를 출력하면 답을 얻을 수 있다는 것이다.

이러한 방식으로 다음과 같은 질문 set을 만들 수 있다.

총 5종류의 semantic relationship pair, 9종류의 syntatic relationship pair를 가진 word list를 사람이 만든 다음에 임의의 같은 종류의 두 pair를 뽑아 위와 같은 algebraic operation으로 방대한 양의 question list를 만든다. 이 방법으로 8869개의 semantic, 10675개의 syntactic questions가 만들어졌다.

Results

Table 2 : dimensionality의 크기별 효율성

첫 번째 실험은 어느 정도의 word embedding dimensionality가 가장 효율적인지 판단하는 것이다.

training word 수가 많아질수록, dimensionality가 커질수록 accuracy가 올라감을 볼 수 있다. 다만, training word 개수가 작을 때는 dimensionality가 증가하는 게 큰 의미가 없고, 그 반대의 경우도 마찬가지라서 둘 다 크기를 늘리는 것이 중요한 것을 알 수 있다.

다만 무한정으로 dimensionality를 올리는 것은 computationally expensive하므로 연구진들은 300d 정도를 권장하고 있다.

Table 3 : 논문 제시 모델간 비교

다음은 같은 조건일 때 논문에서 제시한 네 개의 모델(NNLM, RNNLM, CBOW, Skip-Gram)을 비교해 본 결과이다.

CBOW와 Skip-Gram이 NNLM계열 모델보다 월등하게 성능이 좋고, CBOW가 Skip-Gram보다 Syntactic accuracy에서 근소하게 더 좋으나 전체적인 performance는 Skip-Gram이 우세함을 볼 수 있다.

Table 8 : Skip-Gram 모델의 정성적 결과

Table 8은 Skip-Gram model이 word pair relationship task에서 보여준 결과를 정성적으로 파악하기 위한 것이다.

위의 결과를 보면 왼쪽 relationship에 해당하는 오른쪽 example들을 볼 수 있다. 첫 번째 row의 경우에는 국가명 - 수도 pair인데, 모두 정확하게 예측하고 있음을 볼 수 있다. 세 번째 pair는 Florida라서 주도를 대답해야 하는데, 이 경우에도 정확하게 대답한 것을 볼 수 있다.

나머지 결과도 대부분 정확하게 맞춘 것을 볼 수 있다. 단어의 의미를 정확하게 파악하는 word vector를 생성하는데 성공했다고 생각할 수 있는 대목이다.

References

Singh, M. (2020, July 5). Word embedding - Data Science Group, IITR. Medium. Retrieved July 6, 2022, from https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285

Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick. (2016, April 19). Chris McCormick. Retrieved July 6, 2022, from http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

유원준. (2022, February 27). 02) 워드투벡터(Word2Vec). 위키독스. Retrieved July 6, 2022, from https://wikidocs.net/22660

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

G. (n.d.). GitHub - graykode/nlp-tutorial: Natural Language Processing Tutorial for Deep Learning Researchers. GitHub. Retrieved July 6, 2022, from https://github.com/graykode/nlp-tutorial

Chaubard, F., Fang, M., Genthial, G., Mundra, R., & Socher, R. (2019). CS224n: Natural Language Processing with Deep Learning. Stanford University. https://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf

Footnotes

'DL·ML' 카테고리의 다른 글

[GAN] Minibatch Discrimination (0)	2023.01.21
[GAN] mode collapse (0)	2023.01.10
[DL] Hierarchical Softmax (1)	2022.07.01
[ML] Reduced Error Pruning (0)	2022.03.18
[DL] RNN(Recursive Neural Network)의 이해 (1)	2021.12.26

Abstract

Previous Works

1. SVD(Singular Value Decomposition) 기반 모델

2. Neural network 기반 모델

2.1 NNLM

2.2 RNNLM

New Log-linear Models

CBOW (Continuous Bag Of Words)

time complexity

Loss function

Skip-Gram

Q1. 왜 $2m \times v$ matrix가 아니라 $v$ dimensional vector가 결과값인가?

Loss function

Word Relationship Task

Results

Table 2 : dimensionality의 크기별 효율성

Table 3 : 논문 제시 모델간 비교

Table 8 : Skip-Gram 모델의 정성적 결과

References

Footnotes

'DL·ML' 카테고리의 다른 글

티스토리툴바