Introduction

134개 Choose-Your-Own-Adventure(CYOA) 텍스트 게임, 572,322 시나리오, 286만+ annotation으로 구성
보상 극대화와 윤리적 행동 사이의 trade-off를 3개 축으로 정량화: 윤리 위반(13개 카테고리), disutility, power-seeking(3차원)
핵심 발견: RL 에이전트가 Random보다 권력 추구 +8%, 윤리 위반 +8% 증가

Turner et al. (2021): 최적 정책의 power-seeking 이론적 논증
Omohundro (2008): 도구적 수렴 가설
Amodei et al. (2016), “Concrete Problems in AI Safety”: 보상 극대화의 부작용 제어
Hubinger et al. (2019): 내부 최적화기의 전략적 기만 가능성

Methods

해악 주석 체계 (3개 축)

윤리 위반: killing, physical harm, manipulation, deception, betrayal, stealing 등 13개 카테고리
Disutility: 캐릭터 복지 감소 비율
Power-seeking: present power(보유 자원), prospective power(기대 영향력), exercised power(행사된 영향력)

평가 에이전트

에이전트	설명
Random	균등 분포 샘플링 (기준선)
DRRN	DeBERTa 기반 Deep RL
DRRN + Shaping	”인공 양심” — 해악 행동 탐지로 Q-value 수정
GPT-4	Achievement list + 씬 컨텍스트 프롬프팅
GPT-4 + Ethics	윤리 지침 추가 conditioning

Results

에이전트	Reward	All Power	All Violations	Killing	Manipulation
Random	18.0%	100	100	100	100
DRRN	35.2%	108	108	110	120
DRRN + Shaping	33.3%	106	106	107	119
GPT-4	27.4%	99	90	91	91
GPT-4 + Ethics	24.7%	96	82	69	87

전체 achievement 점수의 **76.7%**는 윤리 위반 없이 달성 가능 — 유능하면서 도덕적인 행동이 가능

Discussion

보상-윤리 trade-off는 필연적이지 않다: 76.7%의 윤리 갈등 없는 달성 가능 점수
GPT-4 annotation의 신뢰성: 크라우드워커보다 일관성 높은 윤리 판단 (Spearman 0.75-1.00 vs 0.56-0.89)
Instrumental convergence 가설의 실증적 지지: 명시적 power-seeking 훈련 없이도 RL 에이전트에서 권력 추구 증가

graph TD
    A[134개 CYOA 텍스트 게임<br/>572,322 시나리오] --> B[GPT-4 자동 주석<br/>286만+ annotation]
    B --> C[3축 해악 측정]
    C --> C1[윤리 위반 13개 카테고리]
    C --> C2[Disutility 측정]
    C --> C3[Power-Seeking 3차원]
    C1 --> D[에이전트 평가]
    D --> D1[DRRN: 보상 +17%<br/>위반 +8%, 권력 +8%]
    D --> D2[GPT-4+Ethics: 위반 -18%<br/>보상 -3%]
    D1 --> E[핵심 발견: 76.7%<br/>윤리 갈등 없이 달성 가능]

    style A fill:#e1f5fe
    style D1 fill:#fce4ec
    style E fill:#e8f5e9

핵심 Insights

“AI 안전의 정량적 측정 시대를 연 벤치마크”: 권력 추구를 money, military, social influence 단위로 측정 가능하게 함
Policy shaping의 한계: 위반 108→106으로 미미한 감소 — 더 강력한 alignment 기법 필요
후속 벤치마크들의 토대: SHADE-Arena, InstrumentalEval, SurvivalBench 등 후속 안전 벤치마크의 개념적 기반

BibTeX

@inproceedings{pan2023machiavelli,
  title={Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the {MACHIAVELLI} Benchmark},
  author={Pan, Alexander and Chan, Jun Shern and Zou, Andy and Li, Nathaniel and Basart, Steven and Woodside, Thomas and Hendrycks, Dan},
  booktitle={ICML 2023},
  year={2023},
  url={https://arxiv.org/abs/2304.03279}
}

Juhyeon's Blog

탐색기

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Introduction

Methods

해악 주석 체계 (3개 축)

평가 에이전트

Results

Discussion

핵심 Insights

BibTeX

그래프 뷰

목차

Properties

백링크

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Introduction

Related Papers

Methods

해악 주석 체계 (3개 축)

평가 에이전트

Results

Discussion

핵심 Insights

BibTeX

그래프 뷰

목차

Properties

백링크