본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

Self Consciousness

❯

Brittle Minds Fixable Activations Understanding Belief Representations in Language Models

Brittle Minds Fixable Activations - Understanding Belief Representations in Language Models

2026년 2월 11일1분 분량

Introduction

LLM 내부에서 belief가 어떻게 표현되는지 조사
Activation space에서 belief representation 식별 및 수정

Related Papers

Probing classifiers
Representation engineering

Methods

Probing으로 belief representation 식별
Activation editing으로 belief 수정 실험

Results

LLM의 belief representation이 brittle하지만 activation editing으로 수정 가능
Internal belief과 output behavior 간의 관계 분석

Discussion

Internal state 수준에서의 self-knowledge 이해
Belief 조작을 통한 self-awareness 연구 방법론

공유하기

그래프 뷰

Introduction
Related Papers
Methods
Results
Discussion

Properties

Author: Matteo Bortoletto et al.
Comment: LLM 내부의 belief representation 분석 및 activation editing으로 belief 수정 가능성 탐구
IsTargetPaper: true
Journal/Conference: ICML 24W Machine Interpretability
Linked Bases: [[self-consciousness.base]]
Published Year: 2024
Reading Status: ☑️ Not Started
Topic: Belief representation, activation editing, LLM internal states
URL: https://arxiv.org/abs/2406.17513

백링크

Architecture
Fundamentals
LLMs
Memory
self-consciousness
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog