본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

❯

Activation Oracles Training and Evaluating LLMs as General Purpose Activation Explainers

Activation Oracles - Training and Evaluating LLMs as General-Purpose Activation Explainers

2026년 2월 11일2분 분량

Introduction

LLM을 신경망 활성화를 입력으로 받아 자연어로 설명하는 “Activation Oracle”로 훈련
LatentQA의 확장: 활성화에 대한 임의의 질문에 자연어로 답변
Fine-tuned 모델에 숨겨진 정보(전기적 지식, 악성 성향 등) 복구 가능

Related Papers

LatentQA - activation을 입력으로 받는 LLM
Sparse autoencoder 기반 interpretability
Probing / representation analysis 연구

Methods

LLM이 다른 LLM의 activation을 직접 입력으로 수용하도록 훈련
다양한 downstream task에서 activation 설명 능력 평가
Fine-tuned 모델의 activation에서 학습 데이터에 없는 정보 복구 시도
다양한 훈련 데이터셋의 효과 비교

Results

Activation Oracle이 fine-tuned 모델의 숨겨진 정보(전기적 지식, 악성 성향) 복구 성공
입력 텍스트에 나타나지 않는 정보도 activation에서 추출 가능
다양한 데이터셋으로 훈련 시 일관된 성능 향상
기존 interpretability baseline과 동등 이상의 성능

Discussion

모델의 “내면”을 외부에서 읽는 도구로서의 가치
Self-awareness가 아닌 타자에 의한 awareness이나, 자기 인식 연구의 도구로 활용 가능
AI safety에서 숨겨진 악성 행동 탐지에 직접 적용 가능

공유하기

그래프 뷰

Introduction
Related Papers
Methods
Results
Discussion

Properties

Author: Adam Karvonen et al.
Comment: LLM을 activation explainer로 훈련 - fine-tuned 모델의 숨겨진 정보도 복구 가능
IsTargetPaper: true
Journal/Conference: arXiv
Published Year: 2025
Reading Status: ☑️ Not Started
Review Date: 2026-01-30
Topic: LLM Activation Interpretation, Self-Understanding
URL: https://arxiv.org/abs/2512.15674

백링크

Architecture
Fundamentals
LLMs
Memory
self-consciousness
Unlabeled
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog