본문으로 건너뛰기

Juhyeon's Blog

❯

Reinforcement Learning

❯

❯

A2C(Advantage Actor Critic)

A2C(Advantage Actor Critic)

2026년 4월 13일1분 분량

Summary

정석적인 Actor-Critic 알고리즘으로, Actor update에 관여하는 걸 advantage,
$A_{t} := G_{t} - V (S_{t})$
로 사용.

Loss:
$L_{a c t or} = - E_{t} [A_{t} lo g π_{θ} (a_{t} ∣ s_{t})]$
$L_{cr i t i c} = E_{t} [(G_{t} - V_{ϕ} (S_{t}))^{2}]$
$L = L_{a c t or} + c L_{cr i t i c} - β H (π_{θ})$

total-loss 마지막 term은 entropy 기반이고, exploration을 위해 종종 추가한다고 한다.

Policy Entropy

$H (π_{θ} (\cdot ∣ s)) = - \sum_{a} π_{θ} (a ∣ s) lo g π_{θ} (a ∣ s)$

의미:

높음: 모든 행동을 비슷한 확률로 선택 (uniform 분포, 최대 탐색)

낮음: 특정 행동만 확실히 선택 (deterministic, 활용 위주)

정의를 보면, policy term 자체로 entropy 정의한 것.

결론: β H(π)는 **“너무 빨리 확신하지 말고 계속 탐험해!”**라고 외치는 안전장치야. 특히 복잡한 환경에서 local optimum에 갇히는 걸 막아주는 핵심 기법이지!

공유하기

그래프 뷰

Properties

No properties

백링크

The Student's Guide to Cognitive NeuroScience
Memory
Architecture
Benchmarks
LLMs
Fundamentals
self-consciousness
15-Field Structured Analysis
Theory of mind
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog