General Attention(Cross-Attention)

General Attention(scaled-dot attention)

Attention block in Image Captioning

alignment layer를 통과한 후, softmax를 통과시켜서 attention-map을 획득.

attention map이 그 부분에 얼마나 attention을 가할 지 말하는 거니, 이를 원본에다 element-wise mul하고, sum.

이의 output이 context vector.

이를 일반화해서 아래처럼 그릴 수 있다.

$f_{a tt}$ 는 simple하게 dot-product 사용할 수도 있다.

이때, dot-attention에 scaling(smoothing)을 걸어주면, scaled-dot attention

$f_{a tt}$ 를 dot-product하고, $d$ 로 나눠주는 것.

benefits:

prob-dist에서 한 값만 극단적으로 커지는 걸 완화하고,

numerical stability 확보 가능(softmax에서 )

General Attention(to the QKV attention)

Query expansion

input vector가 attention 계산과 alignment 시 동일하게 사용되는데, 굳이 같아야 할까?
→ K, V matrix로 decompose.

Key-Value expansion

context 계산과 alignment(attention-map을 구하는) 계산 시 사용하는 input-vector를 중간에 MLP 통과시켜서 압축도 해주고, 서로 다르게 해줘서 expressivity를 증가.

Self-Attention

Self-Attention

Q도 input에서 direct하게 뽑자!

이러한 self-attention도 permutation invariant.

Warning

문제는 이러한 순서에 무관한 게, NLP에서는 좋지 않을수도.
일반적으로 말의 순서가 바뀌면 의미도 바뀌니까.
→ Positional Encoding 도입.

Masked Attention

NOTE

Language Model 학습 시 attentionㅇ 할 때, 현재 time-step 기준으로 이전의 query에 대한 참조를 막고 싶은 때가 있다.

예를 들어 GPT 처럼 NTP 과제를 수행해야 하는 모델을 학습 시킬 때, 다음 token을 예측하기 위해 현재 토큰 이전만 참고하여 attention을 수행해야 하지, 그 이후의 token까지 미리 보고 결과를 예측하는 건 사람이 정보 처리하는 것과는 조금 다르다.

이때 attention에 masking을 적절히 하여, attention하는 것이 masked-attention

future token 위치에 $- \infty$ 를 넣어서 attention map에서 해당 위히에 0이 부여되도록 함.

Multi-head self-Attention

NOTE

One-to-many 같은 task 시 유리한 점도 있다.

일단 기본적으로 expressivity도 증가하겠지.

Comparison between self-attention & general-attention

Summary

self-attention은 input으로 부터 Q, K, V 생성하고, general은 그게 아닐 수도 있다.

Transformer에서

encoder-decorder attention이 general attention의 한 예시

Q: Decoder oriented

K, V : Encoder oriented

encdoer 내부의 attention은 완벽한 self-attention

Q, K, V: 모두 encoder input으로 들어오거나 encoder 내부의 이전 layer로부터 오는 Q, K, V 사용해서 attention 진행.

Examples

Example

Juhyeon's Blog

탐색기

Attention

General Attention(Cross-Attention)

Self-Attention

Masked Attention

Multi-head self-Attention

Comparison between self-attention & general-attention

Examples

그래프 뷰

목차

Properties

백링크