본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

Self Consciousness

❯

Looking Inward Language Models Can Learn About Themselves by Introspection

Looking Inward - Language Models Can Learn About Themselves by Introspection

2026년 2월 11일2분 분량

Introduction

LLM이 내부 상태에서 비롯된 자기 지식(introspection)을 가질 수 있는지 연구
Introspection을 “훈련 데이터에 포함되지 않은, 내부 상태에서 기원하는 지식 획득”으로 정의
모델 M1이 자기 행동 예측에서 다른 모델 M2보다 우수하다면 introspection의 증거

Related Papers

인간 introspection 연구 (심리학/철학)
LLM self-evaluation 연구
Behavioral prediction 관련 연구

Methods

LLM을 가상 시나리오에서 자신의 행동 속성을 예측하도록 fine-tuning
GPT-4, GPT-4o, Llama-3 모델 실험
모델 M1의 자기 예측 vs 다른 모델 M2의 M1 행동 예측 비교
Ground-truth 행동을 의도적으로 수정한 후에도 자기 예측 정확도 유지 여부 확인

Results

M1이 자기 행동 예측에서 M2보다 일관되게 우수 (introspection 증거)
Llama 70B 자기 예측 48.5% vs GPT-4o의 Llama 예측 31.8%
GPT-4o 자기 예측 49.4% vs Llama 70B의 GPT-4o 예측 36.6%
행동 수정 후에도 자기 예측 정확도 유지
단, 복잡한 과제나 OOD 일반화에서는 실패

Discussion

단순 과제에서는 introspection이 성공하나 복잡한 과제로의 확장은 미해결
긴 출력을 요하는 과제(스토리 작성 등)에서는 자기 행동 예측 어려움
Privileged access의 성격과 한계에 대한 추가 연구 필요

공유하기

그래프 뷰

Introduction
Related Papers
Methods
Results
Discussion

Properties

Author: Felix J Binder et al.
Comment: LLM이 자기 행동 예측에서 다른 모델보다 우수 - 진정한 introspection의 증거
IsTargetPaper: true
Journal/Conference: ICLR 2025(Poster)
Linked Bases: [[self-consciousness.base]]
Published Year: 2025-01-01
Reading Status: ✅ Done
Review Date: 2026-01-30
Topic: LLM Introspection, Self-Knowledge
URL: https://arxiv.org/abs/2410.13787

백링크

Architecture
Fundamentals
LLMs
Memory
self-consciousness
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog