RR13. Word Vectors

Word Embedding
Motivation

Summary

세상에 존재하는 많은 지식들은 인간이 이해할 수 있는, text 형태로 이루어져 있는데 이를 컴퓨터가 직접 이해할 수 없다. 따라서 이를 활용하기 위해 지식을 “representation” 해야 한다.

Meaning?

Definition meaning (Webster dictionary)

the idea that is represented by a word, phrase, etc.

the idea that a person wants to express by using words, signs, etc.

the idea that is expresses in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:
$signifier(symbol) \leftrightarrow signified(idea or thing)$
In other words, = denotational semantics

How do we have usable meaning in a computer?

WordNet

NOTE

previous NLP solution: WordNet (based on synonym & hypernym)

synonym: 동의어

hypernym: 상위어

Example

Limits

Title

A useful resource, but missing nauance.

proficient는 맥락에 따라 ‘good’로 대체될 수 있지만, 그렇지 않을 수도.

Missing new meaning of words.

wicked, badass, nifty, …

update가 쉽지 않음.

사람이 하는거라, 주관적(subjective)

word similarity를 계산할 수 없다.

원본 링크

One-hot vector

one-hot vector : one-hot encoding

one-hot vector: true class만 1, 나머지 category에 대해서는 0

원본 링크

Warning

One-hot vector로 데이터를 represent하면, data의 dimension은 voca size랑 동일해져야 해 너무 크고,(also sparse)

motel, hotel 같이 비슷한 단어들임에도, one-hot representation은 orthogonal 해 inner-produdct 값이 0. 유사성을 판단할 마땅한 수단이 없음.

→ solution: WordNet의 synonym을 사용해보자! : 그러나 절망적일 정도로 실패.
대신, “learn to encode similarity in the vectors themselves!”

Representing words by their context

Summary

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by. (Context = meaning)

“You shall know a word by the computer it keeps” (J. R. Firth 1957)

One of the most successful ideas of modern statistical NLP!

“banking” 이라는 동일한 단어라도 맥락에 따라 의미가 달라진다.

Word Vectors

Summary

One-hot vector와 는 다르게 dense 한 표현. 즉, embedding dimension이 voca 사이즈와 같을 필요는 없음. 또한, 그 덕분에 cosine similarity 같은 지표로 embedding 간 유사성을 표현할 수도.

Word2vec

Word2vec

NOTE

Word2vec(Mikolov et al. 2013)에서 제안됨.

Large Corpus로부터 학습시킴.

모든 단어들은 fixed 된 vector로 표현됨.

notation:

$c$ : center word

$o$ : outside word, i.e. context with fixed window-size

단어 벡터 간 유사도를 $c$ 와 $o$ 간의 conditional probability를 계산하는데 사용.

이 확률을 maximizing 하도록 word-vector 반복 수정.

Leaning

Overview

Example

Loss define

Summary

한 word에 대해서도 center로 사용될 때와 context로 사용될 때 다른 vector representation을 가진게 학습을 진행함.

$J (θ) = \frac{1}{T} \sum_{t = 1}^{T} \sum_{- m \leq j \leq m, j \neq = 0} log p (w_{t + j} ∣ w_{t})$

$P (o ∣ c) = \frac{e x p ( u _{o}^{T} v _{c} )}{\sum _{w \in V} e x p ( u _{w}^{T} v _{c} )}$

Softmax와 비슷하게 생김.

“max” : 큰 값들은 amplified하니

“soft” : 작은 값들에도 여전히 확률 값을 주기는 하니까.

Training

Summary

Optimization target 즉, parameter는 모든 단어들의 vector embedding 자체. → voca size 만큼의 d-dimensional vector x 2(center version, context version)
즉,
$θ \in R^{2 d V}$
where $d$ : vector dimension, $V$ : voca size

Question

Why 2 vectors?

easy to optimize.

average both at the end.

Summary

위 식을 통해 gradient를 계산해보면,
$\frac{\partial}{\partial v _{c}} lo g \frac{e x p ( u _{o}^{T} v _{c} )}{\sum _{w = 1}^{V} e x p ( u _{w}^{T} v _{c} )}$

Check

위에 건 v에 대한 미분. u에 대한 미분도 해볼 것.

Tip

Negative sampling이 training에 도움을 준다. (efficient vs naive softmax)

Model Variants(Type)

Skip-grams(SG)

predict context by using center.

course cover this .

Continuous Bag of Words(CBOW)

predict center by using context.

Negative Sampling

Summary

위 식 중 확률 $P (o ∣ c)$ 는 아래처럼 softmax항으로 생겼는데,
$P (o ∣ c) = \frac{e x p ( u _{o}^{T} v _{c} )}{\sum _{w \in V} e x p ( u _{w}^{T} v _{c} )}$
denominator에서 sum term이 voca size라 연산량이 크다.

따라서 naive하게 softmax를 사용하지 않고, negative sampling을 사용한다.

Important

Main Idea: train binary logistic regression for a true pair(center word and a word in its context window) vs several “noise” pairs (the center word paired with a random word)

Quote

From “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)

$J_{n e g - s am pl e} (u_{o}, v_{c}, U) = - lo g σ (u_{o}^{T} v_{c}) - \sum_{k \in {Ks am pl e d in d i c i es}} lo g σ (- u_{k}^{T} v_{c})$

k개의 negative samples(using word occurance probabilities)

Aim: 실제 context 내부에서 발견되는 단어들의 확률은 maximize, 반대로 negative sample 항들에 대한 확률은 minimize.

Tip

$U$ 가 negative sample 추출 시 사용되는 확률 분포이자, occurance distribution인데, 이를 그대로 사용하는 것보다 $U^{3/4}$ 해서 사용하는 것이 경험적으로 더 좋다고 한다.

3/4 power 걸어주면, 저 빈도 단어들이 조금 더 추출되는 효과.

SG(stochastic gradient) with neg sampling

NOTE

업데이트 대상인 vector representation들은 voca size만큼 있으니, 이 한 번의 업데이트 시 현재 주목하는 단어들만 update하면 되는 거지, 나머지 단어들은 건드릴 필요가 없음. → stochastic process를 해야 함.

We might only update the word vectors that “actually appear”!

Solution:

either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V,

or you need to keep around a hash for word vectors

If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!

원본 링크

Count-based methods

Question

왜 전체 통계는 그대로 사용하지 않나?

Word2Vec 같이 iterative하게 구한 값들 말고, 예를 들어 전체 corpus에서 단어 빈도 등을 사용할 수 있지 않냐.(더 traditional 한 approach)

Building a co-occurrence matrix X : 2 options: windows vs. full document

Window: Similar to word2vec, use window around each word → captures some syntactic and semantic information (“word space”)

Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” (“document space”)

Example

Co-occurence Vectors

NOTE

Simple count co-occurrence vectors

Vectors increase in size with vocabulary

M = (voca_size, voca_size)

Very high dimensional: require a lot of storage (though sparse)

Subsequent classification models have sparsity issues → Models are less robust

Low-dimensional vectors

Idea: store “most” of the important information in a fixed, small number of dimensions → a dense vector

Usually 25–1000 dimensions, similar to word2vec

How to reduce the dimensionality?

SVD(SIngular Value Decomposition) 하면 되긴 하는데 여전히 expensive.

Example

raw 값 그대로 SVD하면 별로 좋지 못한 결과. 따라서 아래 tip

Tip

Check

GloVe

GloVe

Summary

“Count-based의 전역 정보 + Word2Vec의 학습적 효율성을 결합한 모델”

Count-based vs Direct prediction

Multi column

Count based LSA, HAL (Lund & Burgess), COALS, Hellinger-PCA (Rohde et al, Lebret & Collobert)

Fast training

Efficient usage of statistics

Primarily used to capture word similarity

Disproportionate importance given to large counts

Direct Prediction Skip-gram/CBOW (Mikolov et al) NNLM, HLBL, RNN (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton)

Scales with corpus size

Inefficient usage of statistics

Generate improved performance on other tasks

Can capture complex patterns beyond word similarity

Important

“Crucial insight: Ratios of co-occurrence probabilities can encode meaning components”

$x = solid$ 를 보면, ice랑 더 관련 있음. 이러한 확률 비를 vector space에 모델링 하는 것이 GloVe

위의 걸 어떻게 Design?

Q: How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space (i.e., embedding ratios of co-occurrence probabilities)?

A: Log-bilinear model
$w_{i} \cdot w_{j} = lo g P (i ∣ j)$
with vector differencs
$w_{x} \cdot (w_{a} - w_{b}) = lo g \frac{P ( x ∣ a )}{P ( x ∣ b )}$

GloVe

Summary

count base, direct prediction을 합치니,

fast training

scalable to huge copora

good performance even with small corpus and small vectors

원본 링크

Evaluating word vectors

Summary

word embedding이 좋은지 아닌지 어떻게 알 수 있나? → related to evaluation in NLP

Intrinsic:

Evaluation on a specific/intermediate subtask

Fast to compute

Helps to understand that system

Not clear if really helpful unless correlation to real task is established

Extrinsic:

Evaluation on a real task

Can take a long time to compute accuracy

Unclear if the subsystem is the problem or its interaction or other subsystems

If replacing exactly one subsystem with another improves accuracy ⇒ Winning!

Intrinsic Evaluation

Example

In word vector,
$man : woman = king : ?$

Warning

information 간 관계가 non-linear하다면?

Example

Extrinsic Evaluation

Word Senses

Summary

하나의 word라도 의미가 매우 다양하다.
이전 시도들은 모두 하나의 단어에는 하나의 fixed된 word embedding을 부여하니 이걸론 모자르다.

Example

multi-sense 문제를 cover하기 위해 Improving Word Representations Via Global Context And Multiple Word Prototypes(Huang et al. 2012) 논문에서는 아래처럼 bank라는 단어에 여러 개 embedding을 부여한다.

알고리즘 흐름 (요약) by GPT

1단계: 기본 임베딩 학습

일반 Word2Vec으로 초기 임베딩 생성.

2단계: 문맥 벡터 수집 및 클러스터링

각 단어 주변 window 벡터(=context mean vector)를 모아서 클러스터링.

3단계: Sense assignment

각 등장 단어를 해당 문맥의 클러스터에 따라 sense ID 부여 (예: bank₁, bank₂).

4단계: 재학습

sense-tagged 코퍼스로 다시 Word2Vec 학습 → multi-sense embeddings 생성.

이후 Neelakantan et al., 2014 (Multi-Sense Skip-gram) 모델로 확장된다고 함.

Implementation

Naive Implementation
Multi column
SimpleCBOW
class SimpleCBOW:
	def __init__(self, vocab_size, hidden_size):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(H, V).astype('f')
		
		# layer creation
		self.in_layer0 = MatMul(W_in)
		self.in_layer1 = MatMul(W_in)
		self.out_layer = MatMul(W_out)
		self.loss_layer = SoftmaxWithLoss()
		
		# collect paramters and gradients
		layers = [self.in_layer0, self.in_layer1, self.out_layer]
		self.params, self.grads = [], []
		for layer in layers:
			self.params += layer.params
			self.grads += layer.grads
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		h0 = self.in_layer0.forward(contexts[:, 0])
		h1 = self.in_layer1.forward(contexts[:, 1])
		h = (h0 + h1) * 0.5
		score = self.out_layer.forward(h)
		loss = self.loss_layer.forward(score, target)
		return loss
		
	def backward(self, dout=1):
		# 'ppp' exercise
		ds = self.loss_laer.backward(dout)
		da = self.out_layer.backward(ds)
		da *= 0.5
		self.in_layer1.backward(da)
		self.in_layer0.backward(da)
		return None
Example
class SimpleSkipGram:
	def __init__(self, vocab_size, hidden_size):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(H, V).astype('f'
		
		# layer creation
		self.in_layer = MatMul(W_in)
		self.out_layer = MatMul(W_out)
		self.loss_layer1 = SoftmaxWithLoss()
		self.loss_l ayer2 = SoftmaxWithLoss()
		
		# collect paramters and gradients
		layers = [self.in_layer, self.out_layer]
		self.params, self.grads = [], []
		for layer in layers:
		self.params += layer.params
		self.grads += layer.grads
		
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		# 'ppp' exercise
		h = self.in_layer.forward(target)
		s = self.out_layer.forward(h)
		
		# [0,0,0,1,0,1,0,0,0,0]
		l1 = self.loss_layer1.forward(s, contexts[:, 0])
		l2 = self.loss_layer2.forward(s, contexts[:, 1])
		loss = l1+ l2
		return loss
		
	def backward(self, dout=1):
	# 'ppp' exercise
	dl1 = self.loss_layer1.backward(dout)
	dl2 = self.loss_layer2.backward(dout)
	ds = dl1 + dl2
	dh = self.out_layer.backward(ds)
	self.fin_layer.backward(dh)
	return None
Summary

CBOW : predict “center” by using context
SG : predict “context” by using center

softmax 기반의 naive한 구현. 데이터 크기 즉, voca size가 커지면 softmax에 필요한 연산량이 비례하게 증가하여 실제로는 optimization이 필요함.

Trainer
Example
from sys import float_repr_style
 
class Trainer:
	def __init__(self, model, optimizer):
		self.model = model
		self.optimizer = optimizer
		self.loss_list = []
		self.eval_interval = None
		self.current_epoch = 0
		
	def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
		data_size = len(x)
		max_iters = data_size // batch_size
		self.eval_interval = eval_interval=
		model.optimizer = self.model, self.optimizer
		total_loss, loss_count = 0, 0
		start_time = time.time()
	
		for epoch in range(max_epoch):
			idx = np.random.permutation(np.arange(data_size))
			x, t = x[idx], t[idx]
		
			for iters in range(max_iters):
				batch_x = x[iters*batch_size : (iters+1)*batch_size]
				batch_t = t[iters*batch_size : (iters+1)*batch_size]
				loss = model.forward(batch_x, batch_t)
				params, grads = remove_duplicate(model.params, model.grads)
		
			if max_grad is not None:
				clip_grads(grads, max_grad)
				optimizer.update(params, grads)
				total_loss += loss
				loss_count += 1
				
		# evaluation
		if (eval_interval is not None) and (iters % eval_interval) == 0:
			avg_loss = total_loss / loss_count
			elapsed_time = time.time() - start_time
			print(f'|epoch {self.current_epoch + 1} | iter {(iters+1) / max_iters} | time {elapsed_time} | loss {avg_loss}')
			
	def plot(self, ylim=None):
		x = np.arange(len(self.loss_list))
		if ylim is not None:
			plt.ylim(*ylim)
		plt.plot(x, self.loss_list, label='train')
		plt.xlabel('iter (x' + str(self.eval_interval) + ')')
		plt.ylabel('loss')
		plt.show()
		
	def remove_duplicate(params, grads):
		'''
		gather shared paramters and add up gradients
		'''
	
		params, grads = params[:], grads[:] # copy list
	
		while True:
			find_fig = False
			L = len(params)
		
			for i in range(0, L-1):
				for j in range(i+1, L):
					if params[i] is params[j]:
						grads[i] += grads[j]
						find_fig = True
						params.pop(j)
						grads.pop(j)
					
					elif params[i].ndim == 2 and params.ndim == 2 and \
					params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
						grads[i] += grads[j].T
						find_fig = True
						params.pop(j)
						grads.pop(j)
					if find_fig : break
				if find_fig : break
			if not find_fig : break
		return params, grads
		
	def clip_grads(grads, max_norm):
		total_norm = 0
		
		for grad in grads:
			total_norm += np.sum(grad ** 2)
			total_norm = np.sqrt(total_norm)
			rate = max_norm / (total_norm + 1e-6)
			
			if rate < 1:
				for grad in grads:
					grad *= rate
Summary

repeat되는 학습 코드를 라이브러리 들에서 제공함. torch, torch-lightening 등..

Embedding
Embedding
class Embedding:
	def __init__(self, W):
		self.params = [W]
		self.grads = [np.zeros_like(W)]
		self.idx = None
		
	def forward(self, idx):
		W, = self.params
		self.idx = idx
		out = W[idx]
		return out
		
	def backward(self, dout):
		dW, = self.grads
		dW[...] = 0
		np.add.at(dW, self.idx, dout)
		return None
Embedding Dot
import collections
 
class EmbeddingDot:
	def __init__(self, W):
		self.embed = Embedding(W)
		self.params = self.embed.params
		self.grads = self.embed.grads
		self.cache = None
		
	def forward(self, h, idx):
		target_W = self.embed.forward(idx)
		out = np.sum(target_W * h, axis=1) # dot product
		self.cache = (h, target_W)
		return out
		
	def backward(self, dout):
		h, target_W = self.cache
		dout = dout.reshape(dout.shape[0], 1)
		dtarget_W = dout * h
		self.embed.backward(dtarget_W)
		dh = dout * target_W
		return dh
Summary

이걸 사용해서 W2V 구축.

Sampler
UnigramSampler
class UnigramSampler:
	def __init__(self, corpus, power, sample_size):
		self.sample_size = sample_size
		self.vocab_size = None
		self.word_p = None
		
		counts = collections.Counter()
		for word_id in corpus:
			counts[word_id] += 1
			
		vocab_size = len(counts)
		self.vocab_size = vocab_size
		
		self.word_p = np.zeros(vocab_size)	
		for i in range(vocab_size):
			self.word_p[i] = counts[i]
			
		self.word_p = np.power(self.word_p, power)
		self.word_p /= np.sum(self.word_p)
		
	def get_negative_sample(self, target):
		batch_size = target.shape[0]
		negative_sample = np.zeros((batch_size, self.sample_size), dtype=np.int32)
		
		for i in range(batch_size):
		# 'ppp' exercise
		p = self.word_p.copy()
		target_idx = target[i]
		p[target_idx] = 0
		p /= p.sum()
		
		negative_sample[i, :] = np.random.choice(self.voca_size, size=self.sample_size, replace=False, p=p)
		return negative_sample
Negative Sampling Loss
class NegativeSamplingLoss:
	def __init__(self, W, corpus, power=0.75, sample_size=5):
		self.sample_size = sample_size
		self.sampler = UnigramSampler(corpus, power, sample_size)
		self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
		self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]
		self.params, self.grads = [], []
		
		for layer in self.embed_dot_layers:
			self.params += layer.params
			self.grads += layer.grads
			
	def forward(self, h, target):
		batch_size = target.shape[0]
		negative_sample = self.sampler.get_negative_sample(target)
		
		# forward for the positive pair
		score = self.embed_dot_layers[0].forward(h, target)
		correct_label = np.ones(batch_size, dtype=np.int32)
		loss = self.loss_layers[0].forward(score, correct_label)
		
		# forward for negative pairs
		negative_label = np.zeros(batch_size, dtype=np.int32)
		
		for i in range(self.sample_size):
			negative_target = negative_sample[:, i]
			score = self.embed_dot_layers[1 + i].forward(h, negative_target)
			loss += self.loss_layers[1 + i].forward(score, negative_label)
			
		return loss
		
	def backward(self, dout=1):
		dh = 0
		for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
			dscore = l0.backward(dout)
			dh += l1.backward(dscore)
			
		return dh
Summary

Negative sampling을 사용하여 softmax가 요구하는 것보다 적은 연산량으로 loss define.

Optimized W2V
Multi column
CBOW
class CBOW:
	def __init__(self, vocab_size, hidden_size, window_size, corpus):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(V, H).astype('f')
		
		# layer creation
		self.in_layers = []
		for i in range(2 * window_size):
			# (1, vocab_size) * (vocab_size, hidden_size)
			layer = Embedding(W_in)
			self.in_layers.append(layer)
		self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)
		
		# collect paramters and gradients
		layers = self.in_layers + [self.ns_loss]
		self.params, self.grads = [], []
		for layer in layers:
			self.params += layer.params
			self.grads += layer.grads
			
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		h = 0
		for i, layer in enumerate(self.in_layers):
			h += layer.forward(contexts[:, i])
		h *= 1 / len(self.in_layers)
		loss = self.ns_loss.forward(h, target)
		return loss
		
	def backward(self, dout=1):
		dout = self.ns_loss.backward(dout)
		dout *= 1 / len(self.in_layers)
		for layer in self.in_layers:
		layer.backward(dout)
		return None
SG
class SkipGram:
# 'ppp' exercise
	def __init__(self, voca_size, hidden_size, window_size, corpus):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(V, H).astype('f')
		
		# layer creation
		self.in_layer = Embedding(W_in)
		self.loss_layers = []
		
		for i in range(2 * window_size):
			layer = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)
			self.loss_layers.append(layer)
			
		# collect parameters and grads
		layers = [self.in_layer] + self.loss_layers
		self.params, self.grads = [], []
		for layer in layers:
			self.params += layer.params
			self.grads += layer.grads
			
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		h = self.in_layer.forward(target)
		loss = 0
		for i, layer in enumerate(self.loss_layers):
			loss += layer.forward(h, contexts[:, i])
		return loss
		
	def backward(self, dout = 1):
		dh = 0
		for i, layer in enumerate(self.loss_layers):
			dh += layer.backward(dout)
		self.in_layer.backward(dh)
		return None
원본 링크

RR13. Word Vectors

Word Embedding

Motivation

How do we have usable meaning in a computer?

WordNet

Limits

One-hot vector

Representing words by their context

Word Vectors

Word2vec

Word2vec

Leaning

Overview

Loss define

Training

Model Variants(Type)

Skip-grams(SG)

Continuous Bag of Words(CBOW)

Negative Sampling

SG(stochastic gradient) with neg sampling

Count-based methods

Co-occurence Vectors

GloVe

GloVe

Count-based vs Direct prediction

GloVe

Evaluating word vectors

Intrinsic Evaluation

Extrinsic Evaluation

Word Senses

Implementation

Naive Implementation

Trainer

Embedding

Sampler

Optimized W2V

그래프 뷰

Properties

백링크