Softmax

Summary

일반적으로 multiclass classification의 output layer로 사용되어 각 클래스 별 확률 값을 반환하는 용도로 자주 사용된다.

softmax implementation-2(stable)
softmax는 구성 식에 exp 함수가 들어가는데, logit이 커지면 값이 폭발적으로 증가하므로, 범위를 줄여줄 필요가 있다. 따라서 아래와 같이 구현하면 좋다.
def softmax_stable(x: list[float]) -> list[float]:
	c = np.max(x)
	exp_x = np.exp(x-c)
	sum_exp_x = exp_x.sum()
	return exp_x / sum_exp_x

mini-batch version softmax

def softmax(x):
	if x.ndim == 2:
		x = x.T # transpose를 해야하는 이유 <- np.max의 return이 row vector이니,,
			x = x - np.max(x, axis=0) # for stability
		y = np.exp(x) / np.sum(np.exp(x), axis=0) # softmax
		return y.T
	x = x - np.max(x)
	return np.exp(x) / np.sum(np.exp(x))

Softmax

How can we ensure that $f_{w}^{(c)} (x)$ predicts a valid categorical(discrete) distribution?

should guarantee

$f_{w}^{(c)} (x) \in [0, 1]$

$\sum_{c = 1}^{C} f_{w}^{(c)} (x) = 1$

element-wise sigmoid as output function would ensure(1). but not (2)
→ therefore, in this case, use softmax.

Let $s$ denote the network output after the last affine layer(=scores). Then:
$f_{w}^{(c)} (x) = \frac{exp ( s _{c} )}{\sum _{k = 1}^{C} exp ( s _{k} )} ⟹ log f_{w}^{(c)} (x) = s_{c} - log \sum_{k = 1}^{C} exp (s_{k})$

Remark: $s_{c}$ is a direct contribution to the loss function, i.e., it does not saturate

Juhyeon's Blog

탐색기

Softmax

그래프 뷰

Properties

백링크