Word Embedding

Motivation


Summary

세상에 존재하는 많은 지식들은 인간이 이해할 수 있는, text 형태로 이루어져 있는데 이를 컴퓨터가 직접 이해할 수 없다. 따라서 이를 활용하기 위해 지식을 “representation” 해야 한다.

Meaning?

Definition meaning (Webster dictionary)

  • the idea that is represented by a word, phrase, etc.
  • the idea that a person wants to express by using words, signs, etc.
  • the idea that is expresses in a work of writing, art, etc.

Commonest linguistic way of thinking of meaning:

In other words, = denotational semantics

How do we have usable meaning in a computer?


WordNet

NOTE

previous NLP solution: WordNet (based on synonym & hypernym)

  • synonym: 동의어
  • hypernym: 상위어

Example

Limits


Title

A useful resource, but missing nauance.

  • proficient는 맥락에 따라 ‘good’로 대체될 수 있지만, 그렇지 않을 수도.

Missing new meaning of words.

  • wicked, badass, nifty, …
  • update가 쉽지 않음.

사람이 하는거라, 주관적(subjective)

word similarity를 계산할 수 없다.

원본 링크

One-hot vector

one-hot vector : one-hot encoding

one-hot vector: true class만 1, 나머지 category에 대해서는 0


원본 링크

Warning

  • One-hot vector로 데이터를 represent하면, data의 dimension은 voca size랑 동일해져야 해 너무 크고,(also sparse)
  • motel, hotel 같이 비슷한 단어들임에도, one-hot representation은 orthogonal 해 inner-produdct 값이 0. 유사성을 판단할 마땅한 수단이 없음.

→ solution: WordNet의 synonym을 사용해보자! : 그러나 절망적일 정도로 실패.
대신, “learn to encode similarity in the vectors themselves!”

Representing words by their context


Summary

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by. (Context = meaning)

  • “You shall know a word by the computer it keeps” (J. R. Firth 1957)
  • One of the most successful ideas of modern statistical NLP!

“banking” 이라는 동일한 단어라도 맥락에 따라 의미가 달라진다.

Word Vectors


Summary

One-hot vector와 는 다르게 dense 한 표현. 즉, embedding dimension이 voca 사이즈와 같을 필요는 없음. 또한, 그 덕분에 cosine similarity 같은 지표로 embedding 간 유사성을 표현할 수도.

Word2vec


Word2vec

NOTE

  • Word2vec(Mikolov et al. 2013)에서 제안됨.
  • Large Corpus로부터 학습시킴.
  • 모든 단어들은 fixed 된 vector로 표현됨.
  • notation:
    • : center word
    • : outside word, i.e. context with fixed window-size
  • 단어 벡터 간 유사도를 간의 conditional probability를 계산하는데 사용.
  • 이 확률을 maximizing 하도록 word-vector 반복 수정.

Leaning


Overview


Example


Loss define


Summary

  • 한 word에 대해서도 center로 사용될 때와 context로 사용될 때 다른 vector representation을 가진게 학습을 진행함.

  • Softmax와 비슷하게 생김.
  • “max” : 큰 값들은 amplified하니
  • “soft” : 작은 값들에도 여전히 확률 값을 주기는 하니까.

Training


Summary

Optimization target 즉, parameter는 모든 단어들의 vector embedding 자체. → voca size 만큼의 d-dimensional vector x 2(center version, context version)
즉,

where : vector dimension, : voca size

Question

Why 2 vectors?

  • easy to optimize.
  • average both at the end.

Summary

위 식을 통해 gradient를 계산해보면,

Check

  • 위에 건 v에 대한 미분. u에 대한 미분도 해볼 것.

Tip

Negative sampling이 training에 도움을 준다. (efficient vs naive softmax)

Model Variants(Type)


Skip-grams(SG)


  • predict context by using center.
  • course cover this .

Continuous Bag of Words(CBOW)


  • predict center by using context.

Negative Sampling


Summary

위 식 중 확률 는 아래처럼 softmax항으로 생겼는데,

denominator에서 sum term이 voca size라 연산량이 크다.

따라서 naive하게 softmax를 사용하지 않고, negative sampling을 사용한다.

Important

Main Idea: train binary logistic regression for a true pair(center word and a word in its context window) vs several “noise” pairs (the center word paired with a random word)

Quote

From “Distributed Representations of Words and Phrases and their Compositionality” (Mikolov et al. 2013)

  • k개의 negative samples(using word occurance probabilities)
  • Aim: 실제 context 내부에서 발견되는 단어들의 확률은 maximize, 반대로 negative sample 항들에 대한 확률은 minimize.

Tip

가 negative sample 추출 시 사용되는 확률 분포이자, occurance distribution인데, 이를 그대로 사용하는 것보다 해서 사용하는 것이 경험적으로 더 좋다고 한다.

  • 3/4 power 걸어주면, 저 빈도 단어들이 조금 더 추출되는 효과.

SG(stochastic gradient) with neg sampling


NOTE

업데이트 대상인 vector representation들은 voca size만큼 있으니, 이 한 번의 업데이트 시 현재 주목하는 단어들만 update하면 되는 거지, 나머지 단어들은 건드릴 필요가 없음. → stochastic process를 해야 함.

We might only update the word vectors that “actually appear”!

Solution:

  1. either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V,
  2. or you need to keep around a hash for word vectors

If you have millions of word vectors and do distributed computing, it is important to not have to send gigantic updates around!

원본 링크

Count-based methods


Question

왜 전체 통계는 그대로 사용하지 않나?

  • Word2Vec 같이 iterative하게 구한 값들 말고, 예를 들어 전체 corpus에서 단어 빈도 등을 사용할 수 있지 않냐.(더 traditional 한 approach)

Building a co-occurrence matrix X : 2 options: windows vs. full document

  • Window: Similar to word2vec, use window around each word → captures some syntactic and semantic information (“word space”)
  • Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” (“document space”)

Example

Co-occurence Vectors


NOTE

Simple count co-occurrence vectors

  • Vectors increase in size with vocabulary

    • M = (voca_size, voca_size)
  • Very high dimensional: require a lot of storage (though sparse)

  • Subsequent classification models have sparsity issues → Models are less robust

Low-dimensional vectors

  • Idea: store “most” of the important information in a fixed, small number of dimensions → a dense vector
  • Usually 25–1000 dimensions, similar to word2vec
  • How to reduce the dimensionality?
    • SVD(SIngular Value Decomposition) 하면 되긴 하는데 여전히 expensive.

Example

  • raw 값 그대로 SVD하면 별로 좋지 못한 결과. 따라서 아래 tip

Tip

Check

GloVe


GloVe

Summary

Count-based의 전역 정보 + Word2Vec의 학습적 효율성을 결합한 모델”

Count-based vs Direct prediction


Multi column

Count based LSA, HAL (Lund & Burgess), COALS, Hellinger-PCA (Rohde et al, Lebret & Collobert)

  • Fast training
  • Efficient usage of statistics
  • Primarily used to capture word similarity
  • Disproportionate importance given to large counts

Direct Prediction Skip-gram/CBOW (Mikolov et al) NNLM, HLBL, RNN (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton)

  • Scales with corpus size
  • Inefficient usage of statistics
  • Generate improved performance on other tasks
  • Can capture complex patterns beyond word similarity

Important

Crucial insight: Ratios of co-occurrence probabilities can encode meaning components

  • 를 보면, ice랑 더 관련 있음. 이러한 확률 비를 vector space에 모델링 하는 것이 GloVe

위의 걸 어떻게 Design?

Q: How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space (i.e., embedding ratios of co-occurrence probabilities)?

A: Log-bilinear model

with vector differencs

GloVe


Summary

count base, direct prediction을 합치니,

  • fast training
  • scalable to huge copora
  • good performance even with small corpus and small vectors
원본 링크

Evaluating word vectors


Summary

word embedding이 좋은지 아닌지 어떻게 알 수 있나? → related to evaluation in NLP

Intrinsic:

  • Evaluation on a specific/intermediate subtask
  • Fast to compute
  • Helps to understand that system
  • Not clear if really helpful unless correlation to real task is established

Extrinsic:

  • Evaluation on a real task
  • Can take a long time to compute accuracy
  • Unclear if the subsystem is the problem or its interaction or other subsystems

If replacing exactly one subsystem with another improves accuracy ⇒ Winning!

Intrinsic Evaluation


Example

In word vector,

Warning

information 간 관계가 non-linear하다면?

Example

Extrinsic Evaluation


Word Senses


Summary

하나의 word라도 의미가 매우 다양하다.
이전 시도들은 모두 하나의 단어에는 하나의 fixed된 word embedding을 부여하니 이걸론 모자르다.

Example

multi-sense 문제를 cover하기 위해 Improving Word Representations Via Global Context And Multiple Word Prototypes(Huang et al. 2012) 논문에서는 아래처럼 bank라는 단어에 여러 개 embedding을 부여한다.

알고리즘 흐름 (요약) by GPT

1단계: 기본 임베딩 학습

  • 일반 Word2Vec으로 초기 임베딩 생성.

2단계: 문맥 벡터 수집 및 클러스터링

  • 각 단어 주변 window 벡터(=context mean vector)를 모아서 클러스터링.

3단계: Sense assignment

  • 각 등장 단어를 해당 문맥의 클러스터에 따라 sense ID 부여 (예: bank₁, bank₂).

4단계: 재학습

  • sense-tagged 코퍼스로 다시 Word2Vec 학습 → multi-sense embeddings 생성.

이후 Neelakantan et al., 2014 (Multi-Sense Skip-gram) 모델로 확장된다고 함.

Implementation


Naive Implementation


Multi column

SimpleCBOW

class SimpleCBOW:
	def __init__(self, vocab_size, hidden_size):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(H, V).astype('f')
		
		# layer creation
		self.in_layer0 = MatMul(W_in)
		self.in_layer1 = MatMul(W_in)
		self.out_layer = MatMul(W_out)
		self.loss_layer = SoftmaxWithLoss()
		
		# collect paramters and gradients
		layers = [self.in_layer0, self.in_layer1, self.out_layer]
		self.params, self.grads = [], []
		for layer in layers:
			self.params += layer.params
			self.grads += layer.grads
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		h0 = self.in_layer0.forward(contexts[:, 0])
		h1 = self.in_layer1.forward(contexts[:, 1])
		h = (h0 + h1) * 0.5
		score = self.out_layer.forward(h)
		loss = self.loss_layer.forward(score, target)
		return loss
		
	def backward(self, dout=1):
		# 'ppp' exercise
		ds = self.loss_laer.backward(dout)
		da = self.out_layer.backward(ds)
		da *= 0.5
		self.in_layer1.backward(da)
		self.in_layer0.backward(da)
		return None

Example

class SimpleSkipGram:
	def __init__(self, vocab_size, hidden_size):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(H, V).astype('f'
		
		# layer creation
		self.in_layer = MatMul(W_in)
		self.out_layer = MatMul(W_out)
		self.loss_layer1 = SoftmaxWithLoss()
		self.loss_l ayer2 = SoftmaxWithLoss()
		
		# collect paramters and gradients
		layers = [self.in_layer, self.out_layer]
		self.params, self.grads = [], []
		for layer in layers:
		self.params += layer.params
		self.grads += layer.grads
		
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		# 'ppp' exercise
		h = self.in_layer.forward(target)
		s = self.out_layer.forward(h)
		
		# [0,0,0,1,0,1,0,0,0,0]
		l1 = self.loss_layer1.forward(s, contexts[:, 0])
		l2 = self.loss_layer2.forward(s, contexts[:, 1])
		loss = l1+ l2
		return loss
		
	def backward(self, dout=1):
	# 'ppp' exercise
	dl1 = self.loss_layer1.backward(dout)
	dl2 = self.loss_layer2.backward(dout)
	ds = dl1 + dl2
	dh = self.out_layer.backward(ds)
	self.fin_layer.backward(dh)
	return None

Summary

CBOW : predict “center” by using context
SG : predict “context” by using center

  • softmax 기반의 naive한 구현. 데이터 크기 즉, voca size가 커지면 softmax에 필요한 연산량이 비례하게 증가하여 실제로는 optimization이 필요함.

Trainer


Example

from sys import float_repr_style
 
class Trainer:
	def __init__(self, model, optimizer):
		self.model = model
		self.optimizer = optimizer
		self.loss_list = []
		self.eval_interval = None
		self.current_epoch = 0
		
	def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
		data_size = len(x)
		max_iters = data_size // batch_size
		self.eval_interval = eval_interval=
		model.optimizer = self.model, self.optimizer
		total_loss, loss_count = 0, 0
		start_time = time.time()
	
		for epoch in range(max_epoch):
			idx = np.random.permutation(np.arange(data_size))
			x, t = x[idx], t[idx]
		
			for iters in range(max_iters):
				batch_x = x[iters*batch_size : (iters+1)*batch_size]
				batch_t = t[iters*batch_size : (iters+1)*batch_size]
				loss = model.forward(batch_x, batch_t)
				params, grads = remove_duplicate(model.params, model.grads)
		
			if max_grad is not None:
				clip_grads(grads, max_grad)
				optimizer.update(params, grads)
				total_loss += loss
				loss_count += 1
				
		# evaluation
		if (eval_interval is not None) and (iters % eval_interval) == 0:
			avg_loss = total_loss / loss_count
			elapsed_time = time.time() - start_time
			print(f'|epoch {self.current_epoch + 1} | iter {(iters+1) / max_iters} | time {elapsed_time} | loss {avg_loss}')
			
	def plot(self, ylim=None):
		x = np.arange(len(self.loss_list))
		if ylim is not None:
			plt.ylim(*ylim)
		plt.plot(x, self.loss_list, label='train')
		plt.xlabel('iter (x' + str(self.eval_interval) + ')')
		plt.ylabel('loss')
		plt.show()
		
	def remove_duplicate(params, grads):
		'''
		gather shared paramters and add up gradients
		'''
	
		params, grads = params[:], grads[:] # copy list
	
		while True:
			find_fig = False
			L = len(params)
		
			for i in range(0, L-1):
				for j in range(i+1, L):
					if params[i] is params[j]:
						grads[i] += grads[j]
						find_fig = True
						params.pop(j)
						grads.pop(j)
					
					elif params[i].ndim == 2 and params.ndim == 2 and \
					params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
						grads[i] += grads[j].T
						find_fig = True
						params.pop(j)
						grads.pop(j)
					if find_fig : break
				if find_fig : break
			if not find_fig : break
		return params, grads
		
	def clip_grads(grads, max_norm):
		total_norm = 0
		
		for grad in grads:
			total_norm += np.sum(grad ** 2)
			total_norm = np.sqrt(total_norm)
			rate = max_norm / (total_norm + 1e-6)
			
			if rate < 1:
				for grad in grads:
					grad *= rate

Summary

repeat되는 학습 코드를 라이브러리 들에서 제공함. torch, torch-lightening 등..

Embedding


Embedding

class Embedding:
	def __init__(self, W):
		self.params = [W]
		self.grads = [np.zeros_like(W)]
		self.idx = None
		
	def forward(self, idx):
		W, = self.params
		self.idx = idx
		out = W[idx]
		return out
		
	def backward(self, dout):
		dW, = self.grads
		dW[...] = 0
		np.add.at(dW, self.idx, dout)
		return None

Embedding Dot

import collections
 
class EmbeddingDot:
	def __init__(self, W):
		self.embed = Embedding(W)
		self.params = self.embed.params
		self.grads = self.embed.grads
		self.cache = None
		
	def forward(self, h, idx):
		target_W = self.embed.forward(idx)
		out = np.sum(target_W * h, axis=1) # dot product
		self.cache = (h, target_W)
		return out
		
	def backward(self, dout):
		h, target_W = self.cache
		dout = dout.reshape(dout.shape[0], 1)
		dtarget_W = dout * h
		self.embed.backward(dtarget_W)
		dh = dout * target_W
		return dh

Summary

이걸 사용해서 W2V 구축.

Sampler


UnigramSampler

class UnigramSampler:
	def __init__(self, corpus, power, sample_size):
		self.sample_size = sample_size
		self.vocab_size = None
		self.word_p = None
		
		counts = collections.Counter()
		for word_id in corpus:
			counts[word_id] += 1
			
		vocab_size = len(counts)
		self.vocab_size = vocab_size
		
		self.word_p = np.zeros(vocab_size)	
		for i in range(vocab_size):
			self.word_p[i] = counts[i]
			
		self.word_p = np.power(self.word_p, power)
		self.word_p /= np.sum(self.word_p)
		
	def get_negative_sample(self, target):
		batch_size = target.shape[0]
		negative_sample = np.zeros((batch_size, self.sample_size), dtype=np.int32)
		
		for i in range(batch_size):
		# 'ppp' exercise
		p = self.word_p.copy()
		target_idx = target[i]
		p[target_idx] = 0
		p /= p.sum()
		
		negative_sample[i, :] = np.random.choice(self.voca_size, size=self.sample_size, replace=False, p=p)
		return negative_sample

Negative Sampling Loss

class NegativeSamplingLoss:
	def __init__(self, W, corpus, power=0.75, sample_size=5):
		self.sample_size = sample_size
		self.sampler = UnigramSampler(corpus, power, sample_size)
		self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
		self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]
		self.params, self.grads = [], []
		
		for layer in self.embed_dot_layers:
			self.params += layer.params
			self.grads += layer.grads
			
	def forward(self, h, target):
		batch_size = target.shape[0]
		negative_sample = self.sampler.get_negative_sample(target)
		
		# forward for the positive pair
		score = self.embed_dot_layers[0].forward(h, target)
		correct_label = np.ones(batch_size, dtype=np.int32)
		loss = self.loss_layers[0].forward(score, correct_label)
		
		# forward for negative pairs
		negative_label = np.zeros(batch_size, dtype=np.int32)
		
		for i in range(self.sample_size):
			negative_target = negative_sample[:, i]
			score = self.embed_dot_layers[1 + i].forward(h, negative_target)
			loss += self.loss_layers[1 + i].forward(score, negative_label)
			
		return loss
		
	def backward(self, dout=1):
		dh = 0
		for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
			dscore = l0.backward(dout)
			dh += l1.backward(dscore)
			
		return dh

Summary

Negative sampling을 사용하여 softmax가 요구하는 것보다 적은 연산량으로 loss define.

Optimized W2V


Multi column

CBOW

class CBOW:
	def __init__(self, vocab_size, hidden_size, window_size, corpus):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(V, H).astype('f')
		
		# layer creation
		self.in_layers = []
		for i in range(2 * window_size):
			# (1, vocab_size) * (vocab_size, hidden_size)
			layer = Embedding(W_in)
			self.in_layers.append(layer)
		self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)
		
		# collect paramters and gradients
		layers = self.in_layers + [self.ns_loss]
		self.params, self.grads = [], []
		for layer in layers:
			self.params += layer.params
			self.grads += layer.grads
			
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		h = 0
		for i, layer in enumerate(self.in_layers):
			h += layer.forward(contexts[:, i])
		h *= 1 / len(self.in_layers)
		loss = self.ns_loss.forward(h, target)
		return loss
		
	def backward(self, dout=1):
		dout = self.ns_loss.backward(dout)
		dout *= 1 / len(self.in_layers)
		for layer in self.in_layers:
		layer.backward(dout)
		return None

SG

class SkipGram:
# 'ppp' exercise
	def __init__(self, voca_size, hidden_size, window_size, corpus):
		V, H = vocab_size, hidden_size
		
		# weight init
		W_in = 0.01 * np.random.randn(V, H).astype('f')
		W_out = 0.01 * np.random.randn(V, H).astype('f')
		
		# layer creation
		self.in_layer = Embedding(W_in)
		self.loss_layers = []
		
		for i in range(2 * window_size):
			layer = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)
			self.loss_layers.append(layer)
			
		# collect parameters and grads
		layers = [self.in_layer] + self.loss_layers
		self.params, self.grads = [], []
		for layer in layers:
			self.params += layer.params
			self.grads += layer.grads
			
		# for external usage
		self.word_vecs = W_in
		
	def forward(self, contexts, target):
		h = self.in_layer.forward(target)
		loss = 0
		for i, layer in enumerate(self.loss_layers):
			loss += layer.forward(h, contexts[:, i])
		return loss
		
	def backward(self, dout = 1):
		dh = 0
		for i, layer in enumerate(self.loss_layers):
			dh += layer.backward(dout)
		self.in_layer.backward(dh)
		return None
원본 링크