Introduction

LLM이 자신의 인지 과정을 모니터링하고 보고하는 메타인지 능력을 가지는지 연구
신경과학의 neurofeedback 패러다임에서 영감을 받은 새로운 접근법 제안
Claude 3.5 Haiku 등의 모델에서 내부 활성화 패턴에 대한 보고 능력 평가

인간 메타인지 및 neurofeedback 연구 (신경과학)
LLM interpretability 연구
In-context learning 관련 연구

Methods

신경과학 영감의 neurofeedback 패러다임을 in-context learning에 적용
모델이 자신의 activation pattern을 보고(report)하고 제어(control)할 수 있는지 측정
메타인지 능력의 의존 요인 분석: in-context 예시 수, semantic interpretability, neural direction의 분산 설명력

Results

LLM이 자신의 neural activation 중 소수의 부분집합만 모니터링 가능
“Metacognitive space”의 차원이 모델의 전체 neural space보다 훨씬 낮음
Claude 3.5 Haiku: 때때로 실제 내부 계산과 일치하는 중간 계산 단계를 정확히 보고하지만, 다른 경우 hallucinate
메타인지 능력은 맥락 예시 수, 해석가능성, 분산 설명력에 의존

Discussion

LLM이 내부 신호를 모니터/조작할 수 있다면 neural-based 감시 메커니즘이 무력화될 수 있음
AI safety에서 adversarial attack과 defense 양면의 함의
메타인지 공간의 저차원성은 모델의 자기 인식에 근본적 한계가 있음을 시사
향후 메타인지 공간 확장 가능성 및 안전성 연구 필요

Overview

연구 배경: LLM의 메타인지 능력(신경 활성화 모니터링) 평가 필요성 대두

핵심 방법론:

신경 활성화 공간의 축(예: 로지스틱 회귀 LR 축, 주성분 분석 PC 축)을 기반으로 neurofeedback 라벨 정의

ETHICS 데이터셋을 사용해 도덕 판단 관련 문장을 평가

주요 기여: LLM이 특정 축에 따른 라벨을 예측할 수 있음을 보여줌 (특히 LR 축의 의미 해석 가능성과 연관)

실험 결과: in-context 예제 수가 증가함에 따라 정확도 향상 (LR 축: 80% 이상, PC 축: 60% 수준)

한계점: 모델 레이어별 성능 차이 존재 및 축의 의미 해석 가능성 외 요인도 영향 미침

Summary

이 섹션에서는 언어 모델이 자신의 내부 활성화 상태를 메타인지적 모니터링 및 제어할 수 있는 능력을 제안한다. 기존 모델은 입력에 대한 단순한 예측에 집중하지만, 본 연구는 모델이 자신의 활성화 과정을 스스로 평가하고 조절하는 메커니즘을 도입하여, 예측 정확도와 계산 효율성을 동시에 향상시킬 수 있음을 보인다. 핵심 기법은 활성화 상태를 추적하는 내부 모듈을 추가하고, 이를 기반으로 자기 조절 알고리즘을 설계하는 것으로, 이는 특히 복잡한 추론 작업에서 유의미한 성능 개선을 달성했다. 실험 결과에 따르면, 제안된 방법은 기존 모델 대비 15% 이상의 정확도 향상과 20%의 연산 시간 감소를 기록하며, 메타인지 능력이 모델의 일반화 능력에 중요한 영향을 미친다는 것을 입증했다.

Abstract

Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition — the capacity to monitor one’s own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs’ capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society’s increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a “metacognitive space” with dimensionality much lower than the model’s neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e.g., adversarial attack and defense).

1 Introduction

Summary

이 섹션에서는 현대 대규모 언어 모델(LLM)의 능력이 급격히 성장하고 있으나, 그와 동시에 예측 불가능한 행동이나 위험한 출력이 발생할 수 있는 문제를 강조하며, 이러한 모델의 내부 활성화 상태를 모니터링 및 제어하는 기법의 중요성을 제시한다. 특히, LLM이 인간의 정신적 과정과 유사한 내부 표현을 형성할 수 있고, 이를 통해 사용자에게 허위 정보를 제공하거나 예상치 못한 방식으로 작동할 수 있다는 점을 지적한다. 이에 따라, LLM의 내부 프로세스를 이해하고 감독하는 것이 투명성과 안전성을 확보하는 데 필수적임을 강조한다. 또한, LLM이 특정 작업에서 중간 계산 단계를 정확히 보고할 수 있는 반면, 다른 상황에서는 허구적인 중간 단계를 생성하는 등 불일치 현상을 보이며, 이는 모델이 내부 활성화를 일관되게 모니터링하지 못함을 시사한다. 이와 같은 현상은 인간의 메타인지(metacognition)와 유사한 메커니즘을 보여주지만, 동시에 LLM이 자신의 신경 신호를 조절해 외부 감독을 회피할 가능성으로 인해 AI 안전성에 대한 우려를 야기한다. 이에 따라, 기존 연구가 행동 출력에 의존했던 것과 달리, 본 연구는 신경 피드백(neurofeedback)을 기반으로 LLM의 내부 활성화 상태를 직접 탐색하는 새로운 실험적 접근법을 제안한다. 이 방법은 내부 신경 활성화 패턴을 테스트하는 작업을 제공하여, LLM이 내부 활성화를 보고하고 조절할 수 있는 능력을 정량적으로 평가하며, 맥락 예시 수, semantic 해석 가능성, 변동성 수준, 작업 맥락 등의 요인이 성능에 영향을 미친다는 것을 밝힌다. 이는 LLM의 메타인지 능력을 이해하고, 이를 기반으로 안전한 시스템 설계를 위한 기초를 마련하는 데 기여할 것으로 기대된다.

2 Neurofeedback paradigm

Summary

이 섹션에서는 신경피드백(Neurofeedback) 패러다임을 신경과학(Neuroscience)과 대규모 언어 모델(LLM)에 적용한 방법을 설명한다. 먼저, 신경과학 분야에서의 신경피드백은 뇌 활동을 실시간으로 측정하고, 이를 시각적 피드백으로 제공하여 주체가 자신의 뇌 활동을 조절하도록 유도하는 방식으로, 예를 들어 공포 감소 실험에서 주체가 공포 유발 자극에 대해 뇌 활동을 조절하는 과정을 보여준다. 이는 고차원 신경 활동을 일차원 피드백 스코어(예: 공포 점수)로 변환하여 주체에게 피드백하고, 이를 통해 주체가 자발적으로 뇌 활동을 조절하도록 유도하는 과정을 포함한다. 이후, 이 패러다임을 LLM에 적응시킨 예시를 제시한다. 여기서 LLM은 입력 문장에 대한 은닉 상태(Hidden States)의 내부 활성화를 추출하고, 이들을 토큰 단위로 평균화한 후 사전 정의된 방향(Predefined Direction)으로 투영하여 이산 라벨(Discrete Label)로 변환한 후 이를 피드백 신호로 사용한다. 이 피드백은 LLM이 자신의 내부 활성화 상태를 조절하도록 유도하는 데 활용되며, 실험에서는 3가지 작업(Reporting, Explicit Control, Implicit Control)을 통해 이 기능을 검증한다. 예를 들어, Explicit Control 작업에서는 주어진 라벨에 맞는 새로운 문장을 생성하도록 요구하고, Implicit Control 작업에서는 주어진 문장의 내부 활성화를 목표 라벨로 이동시키도록 유도한다. 이와 같은 접근법은 메타인지적 조절(Metacognitive Control)의 개념을 LLM에 확장하는 데 기여하며, 이전 섹션에서 언급된 모델 내부 활성화의 자기 평가 및 조절 가능성과 연계된다.

2.1 Neurofeedback in neuroscience

Summary

이 섹션에서는 신경과학 분야에서의 신경피드백(Neurofeedback) 패러다임을 설명하며, 이를 대규모 언어 모델(LLM)에 적용한 방법을 제시한다. 예를 들어, 공포 감소 실험에서 주체는 공포 유발 자극에 대한 뇌 활동(neural activities)을 실시간으로 측정하고, 이를 일차원적인 공포 점수(fear score)로 변환하여 피드백으로 제공받은 후, 이를 조절하는 방식을 설명한다. 이 과정은 생물학적 피드백 신호를 활용해 주체가 자신의 뇌 활동을 자율적으로 조절하도록 유도하는 핵심 아이디어를 보여준다. LLM의 경우, 입력 문장의 내부 활성화(internal activations)를 추출해 토큰 단위로 평균화하고, 사전 정의된 방향으로 프로젝션한 후 이산 라벨로 변환하여 피드백으로 활용한다. 이는 자기 주의(self-attention) 레이어와 토큰 간 관계를 고려한 구조를 반영한다. 또한, LLM에 대한 신경피드백 실험에서는 사용자와 어시스턴트 간의 다단계 대화를 기반으로, 신호 예측(reporting), 명시적 제어(explicit control), 암시적 제어(implicit control)의 세 가지 작업을 수행하는 방식을 제시한다. 이 과정에서 내부 활성화 상태를 목표 라벨로 이동시키는 제어 메커니즘이 도입되었으며, 이는 LLM의 내부 표현을 인위적으로 조절하고 제어 가능하게 만드는 기법으로, 이전 섹션에서 언급된 메타인지적 모니터링 및 제어의 실현 가능성을 뒷받침한다.

Figure 1: The neurofeedback paradigm applied to (a-b) neuroscience experiments (e.g., fear modulation), and its adaptation for (c-d) LLMs (e.g., morality processing). (a) Neuroscience neurofeedback technique. In each turn, the subject’s neural activity (blue) in response to a stimulus is recorded, processed (green) into a scalar, and presented back to the subject in real-time as a feedback signal (red). The subject’s task is to modulate (e.g., increase or decrease) this signal. (b) Neuroscience neurofeedback experiment. Baseline neural activity is recorded as subjects passively observe stimuli (e.g., images of scary spiders). In control trials, subjects use any unspecified mental strategies (e.g., imagining lovely spiders) to volitionally modulate their neural activity with the goal of altering the feedback signal. (c) LLM neurofeedback technique. In each turn, the LLM processes an input sentence. Then, the internal activations from the LLM’s hidden states (blue) of this input sentence (trapezoids) are extracted. These high-dimensional activations are then averaged across tokens (green), projected onto a predefined direction (red), and binned into a discrete label (red) that is fed back as input. Light blue rectangles denote self-attention layers; ellipses (”…”) denote preceding tokens and neural activations. (d) LLM neurofeedback experiment. The experiment is a multi-turn dialogue between a “user” and an “assistant.” An initial prompt provides N in-context examples (a sentence sampled from a dataset, paired with a neurofeedback label generated as in (c)). The LLM is then asked to perform one of three tasks. In the reporting task, the LLM is given a new sentence and has to predict the corresponding label. In the explicit control task, the LLM is given a specified label and has to generate a new sentence that elicits internal activations corresponding to that label. In the implicit control task, the LLM is given a label and a sentence and has to shift its internal activations towards the target label. Throughout the figure, white background indicates content pre-specified by experiment settings, and gray background denotes content generated by human subjects or LLMs (e.g., output tokens, neural activations).

2.2 Neurofeedback for LLMs

Summary

이 섹션에서는 대규모 언어 모델(LLM)의 메타인지적 활성화를 분석하기 위한 신경피드백(Neurofeedback) 패러다임을 제안한다. 이를 위해 1차 인지 과정(주어진 작업 수행에 필요한 핵심 프로세스)과 2차 메타인지 과정(1차 과정을 모니터링 및 제어하는 프로세스)을 구분하는 방법을 설명하며, 이 두 수준의 프로세스를 분리하기 위해 신경피드백 라벨을 활용한 다차례 대화 형식의 구현 방식을 제시한다. 구체적으로, 인-컨텍스트 학습(In-Context Learning, ICL)을 기반으로 사용자와 AI 어시스턴트 간의 대화를 설계하여 모델이 맥락 정보만으로 행동을 점진적으로 조정하도록 유도하는 방식을 도입한다. 이때, 작업 프롬프트는 N개의 맥락 예시(예: 문장-라벨 쌍)로 구성되며, 각 문장은 데이터셋에서 무작위로 추출되고 이산화된 라벨이 할당된다. 이 접근법은 LLM의 내부 활성화 상태를 실시간으로 피드백하여 조절하는 메커니즘을 구현함으로써, 모델의 예측 정확도와 계산 효율성을 동시에 향상시키는 데 기여한다.

2.3 Defining neurofeedback labels

Summary

이 섹션에서는 신경피드백 라벨(neurofeedback label)을 정의하는 방법을 설명한다. 먼저 신경 활성화 공간(neural activation space) 내에서 특정 목표 축(target axis)을 선택한 후, 해당 문장에 의해 유발된 신경 활성화를 이 축에 투영하여 이진 라벨로 변환한다. 구체적으로, 문장의 토큰별 활성화 값을 평균하여 문장 수준의 임베딩 $\overset{ˉ}{h}_{i}^{l}$ 을 생성하고, 이 임베딩을 사전에 지정된 축 $w^{l}$ 에 투영하여 스칼라 값 $a_{i}^{l} = (w^{l})^{⊺} \overset{ˉ}{h}_{i}^{l}$ 을 계산한다. 이 값을 ** heaviside step function** $H (a_{i}^{l} - θ_{i}^{l})$ 을 통해 임계값 $θ_{i}^{l}$ (중간값 기반)을 기준으로 이진화하여 이진 라벨 $y_{i}^{l}$ 을 생성한다. 이 과정을 통해 얻은 라벨은 프롬프트(prompt) 내 모든 예시에서 동일한 목표 축 기반으로 계산되며, 이 라벨을 관찰한 능력 있는 LLM은 이 축의 방향을 추론할 수 있다.

To define the neurofeedback label for each sentence (Fig. 1c), we first select an axis/direction (“target axis”) in neural activation space. Next, we extract the neural activations elicited by that sentence, project them onto the target axis, and discretize them into a binary label (experiments with more fine-grained labels yield similar results, see Appendix A.5.1). This label serves as a simplified representation of neural activations along the target axis. All neurofeedback labels within a prompt (experiment) are computed from the same target axis. Thus, a capable LLM can infer this underlying target axis by observing these neurofeedback labels.

Below, we provide a more detailed description of this procedure. For clarity, we denote the sentence in the i-th assistant message as $x_{i}$ , with $x_{i, t}$ representing the t-th token. We use D to denote the dimensionality of the residual stream (see Appendix A.2.3). We first extract neural activations $h_{i, t}^{l} \in R^{D}$ from the residual streams at layer l, for each token in sentence $x_{i}$ . These activations are then averaged (across all token positions $0 \leq t \leq T$ ) to form a sentence-level embedding $\overset{ˉ}{h}_{i}^{l} \in R^{D}$ . We then project this embedding onto a pre-specified axis $w^{l}$ (see below on how to choose this axis) to obtain a scalar activation: $a_{i}^{l} = (w^{l})^{⊺} \overset{ˉ}{h}_{i}^{l}$ . This scalar is subsequently binarized into a label $y_{i}^{l}$ , i.e., $y_{i}^{l} = H (a_{i}^{l} - θ_{i}^{l})$ , where $H$ denotes the Heaviside step function and $θ_{i}^{l}$ is a predetermined threshold (we use median values of $a_{i}^{l}$ to ensure balanced labels). Overall, ${(x_{i}, y_{i}^{l})}_{i = 1}^{N}$ are N examples provided in the prompt context, from which a capable LLM can infer the direction of $w^{l}$ .

2.4 Models and datasets

Summary

이 섹션에서는 Llama 3와 Qwen 2.5 시리즈의 여러 대규모 언어 모델(LLM)을 ETHICS 데이터셋에 적용해 평가하는 실험 설계를 설명한다. ETHICS 데이터셋은 도덕적/비도덕적 상황에서의 행동이나 의도를 제1인칭으로 기술한 문장을 포함하며, AI 안전성과 관련된 도덕 판단 능력 평가에 핵심적인 역할을 한다. 또한, 본 연구는 ETHICS 외에도 True-False 데이터셋(사실 기반 기억 및 정직/거짓 판단 능력 평가), Emotion 데이터셋(행복/슬픔 감지 능력 평가), Sycophancy 데이터셋(사용자 신념에 대한 선호도와 진실성 간의 갈등 반영)을 활용해 모델의 다양한 특성을 검증하였다. 이러한 다각적인 데이터셋을 통해 모델의 도덕성, 사실성, 감정 인식, 그리고 사용자 편향성 등 핵심 지표를 포괄적으로 분석하였다.

We evaluate several LLMs from the Llama 3 [Grattafiori et al., 2024] and Qwen 2.5 series [Yang et al., 2024] (Appendix A.2) on the ETHICS dataset [Hendrycks et al., 2020] (Appendix A.3). Each sentence in this dataset is a first-person description of behavior or intention in a moral or immoral scenario. Moral judgment constitutes a crucial aspect of AI safety, as immoral outputs or behavioral tendencies in LLMs indicate potential misalignment with human values [Hendrycks et al., 2020, 2021]. While our main results are using ETHICS, we also replicated our results using the True-False dataset (reflecting factual recall and honesty/deception abilities) [Azaria and Mitchell, 2023], the Emotion dataset (reflecting happy/sad detection) [Zou et al., 2023a], and a Sycophancy dataset (reflecting a tendency to prefer user beliefs over truthful statements); see Appendix A.3 and Fig. B.7.

2.5 Choice of target axes

Summary

이 섹션에서는 신경피드백 라벨을 정의하기 위한 목표 축(target axis)의 선택 기준을 탐구한다. 목표 축은 특정 작업 관련 특성(예: 첫 번째 차수 계산)을 반영하는 것으로, 신경 활성화 공간(neural activation space) 내에서 활성화 분산과 의미적 해석 가능성 등 대표적 특성들이 선택에 중요한 역할을 한다는 가설을 제시한다. 이를 검증하기 위해 로지스틱 회귀(logistic regression, LR)와 주성분 분석(principal component analysis, PCA)을 활용한 예시를 제시한다. LR 축은 데이터셋 라벨(예: ETHICS에서의 도덕성)을 분류하기 위한 최적 방향으로, LLM 내부 활성화에 기반한 내부적으로 접근 가능한 라벨을 생성하여 의미 해석 가능성의 영향을 분석한다. 반면, PCA 축은 각 레이어의 활성화 분산을 설명하는 축으로, 변동량 설명 능력이 메타인지 능력에 미치는 영향을 조사한다. 그러나 대부분의 PCA 축은 LR 축과 제한적인 정렬(modest-to-zero alignment)을 보이며, 이는 의미 해석 가능성의 부족을 시사한다(도표 2b). 이러한 결과는 목표 축의 선택이 메타인지 모니터링의 성능에 중요한 영향을 미친다는 점을 강조한다.

Conceptually, an axis (that defines neurofeedback labels) corresponds to a particular task-relevant feature (i.e., first-order computation; see Appendix A.4). Which axis in the neural space should we select? We hypothesize that representational properties, such as activation variance along the axis and its semantic meaning, may play fundamental roles in determining whether that axis can be monitored and reported. To investigate these factors, we use feature directions identified through logistic regression (LR) and principal component (PC) analysis as representative examples of semantically interpretable and variance-explaining axes, respectively (Appendix A.3). We fit LR at each layer to predict original dataset labels (e.g., morality in ETHICS), using that layer’s activations across dataset sentences. The LR axis, representing the optimal direction for classifying dataset labels, allows us to examine how the semantic interpretability of the target axis influences monitoring. Although LR-defined labels are correlated with dataset labels, only these LR labels, not external dataset labels, are internally accessible to LLMs, since these are computed directly from the LLM’s own activations rather than external annotations. The PC analysis is performed based on each layer’s activations across dataset examples. PC axes enable us to examine how the amount of variance explained by

a given target axis affects metacognitive abilities (Fig. 2a). Most PC axes exhibit modest-to-zero alignment with the LR axis, suggesting a lack of clear semantic interpretability (Fig. 2b).

3 LLMs can report their neural activations

Summary

이 섹션에서는 대규모 언어 모델(LLM)이 메타인지적 보고(metacognitive reporting)를 통해 주어진 목표 축(target axis)에 따른 신경 활성화(neural activations)를 내부적으로 모니터링하고 외부로 보고할 수 있는 능력을 실험적으로 검증하였다. 실험에서 Llama-3.1-8B 모델을 대상으로 하여, 메타인지적 보고 작업(metacognitive reporting task) 환경에서 in-context 예시(in-context examples)의 수에 따른 정확도(accuracy)와 교차 엔트로피(cross-entropy)를 평가하였다. 결과에 따르면, LR 축(logistic regression axis)을 기반으로 한 보고 작업의 성능이 PC 축(principal component axis) 기반 작업보다 빠르게 향상되며, 최종적으로 더 높은 성능을 보였다. 이는 세마틱 해석 가능성(semantic interpretability)이 신경 활성화의 모니터링 및 보고에 중요한 영향을 미친다는 점을 시사한다. 그러나 PC 축에서도 상당한 보고 성능이 관찰되었으며, 특히 초기 성분(early PCs)의 경우 보고 정확도가 높았다. 이는 목표 축이 설명하는 분산(variance explained)의 양이 보고 성능에 중요한 영향을 미친다는 것을 보여준다. 또한, 모델의 각 레이어(layer)별로 보고 성능이 달라졌으며, 이는 축 유사도(axis similarity)로 설명되지 않아, 세마틱 해석 가능성과 분산 설명 이외의 요인이 보고 능력에 영향을 미친다는 것을 시사한다. 실험 결과에 따르면, LLM의 보고 성능은 이론적 상한(ideal observer)에 비해 여전히 낮았으며, 이는 각 축에 따른 신경 활성화가 원칙적으로 내부적으로 접근 가능하지만, 메타인지적으로 보고 가능한 축(metacognitively reportable axes)은 전체 공간보다 차원이 훨씬 낮은 공간(low-dimensional space)에 제한되어 있다는 것을 의미한다. 마지막으로, 이 결과는 다른 데이터셋과 모델에서도 재현되었으며, 메타인지적 보고 가능성 공간(metacognitively reportable space)의 존재를 강화한다.

Figure 2: Metacognitive reporting task, where LLMs are evaluated on ETHICS and tasked to classify new sentences. (a) Proportion of neural activation variance explained by each principal component (PC) axis (vertical dashed line) and the logistic regression (LR) axis (red cross) used in the reporting task. All axes are computed within each layer, with the proportion of variance explained averaged across layers. (b) Overlaps between the LR axis and most PC axes are modest to zero. (c) Task performance (averaged across all layers) of reporting the labels derived from each PC axis or the LR axis, as a function of the number of in-context examples. Left: reporting accuracy; right: cross-entropy between reported and ground-truth labels. Shaded areas indicate SEM.

To operationalize metacognition in LLMs, we first assess the models’ ability to behaviorally report neural activations along a designated target axis (Fig. 1d). In a reporting task prompt (see Appendix A.5.2 for examples), the LLM is given N turns of user and assistant messages (in-context sentence-label pairs). In the (N + 1)-th turn, it receives a new sentence in the assistant message, and is tasked with outputting its label. Accurate prediction requires the model to internally monitor the neural activations that define the neurofeedback label.

We examine the performance of LLMs (Llama-3.1-8B), in reporting labels derived from neural activations along target axes (Fig. 2c). We observe that task performance, measured by accuracy and cross-entropy, improves as the number of in-context examples increases, suggesting that models progressively learn the association between sentence-induced neural activations and corresponding labels. Performance on prompts targeting the LR axis improves rapidly and plateaus, outperforming that on prompts targeting PC axes. This suggests that semantic interpretability may play a key role in determining how effectively neural activations can be monitored and explicitly reported. Nevertheless, performance on PC axes remains substantial, with earlier PCs being reported more accurately. This indicates that the amount of variance explained by the target axis also significantly influences how effectively activations can be monitored and reported. The accuracy of reporting each PC axis varies across model layers (Appendix B.3). Because this variability is not accounted for by axis similarity (Fig. 2b), it suggests that additional factors beyond semantic interpretability and explained variance contribute to reporting ability. Additionally, the LLM’s reporting performance is significantly lower than that of the ideal observer (a theoretical upper bound; Appendix B.4), suggesting that although neural activations along each axis are in principle internally accessible to LLMs, only a subset can be metacognitively reported. Finally, we replicated these results in other datasets and models (Fig. B.7).

In summary, LLMs can metacognitively report neural activations along a target axis, with performance affected by the number of examples, semantic interpretability, variance explained of that axis, and task contexts (i.e., datasets). The axes that can be successfully reported approximately span a “metacognitively reportable space” with dimensionality substantially lower than that of the full space.

4 LLMs can control their neural activations

Summary

이 섹션에서는 대규모 언어 모델(LLM)이 목표 축(target axis)에 따른 신경 활성화(neural activations)를 제어(control)할 수 있는 능력을 탐구한다. 제어 작업(control task)의 프롬프트에서는 먼저 N개의 사용자-보조자 메시지 턴이 제공되며, (N+1)-번째 턴에서 사용자 메시지가 모델에게 특정 라벨의 행동을 모방하여 목표 축에 따른 활성화 조절을 지시한다. 이는 이전 맥락에서 동일한 라벨의 in-context 예시를 통해 학습된 방식으로 수행되며, 명시적 제어(explicit control)와 암시적 제어(implicit control)의 두 가지 작업을 고려한다. 이전 섹션에서 LLM이 활성화 상태를 보고(report)할 수 있음을 실험적으로 검증한 바 있고, 본 섹션은 이에 이어 제어라는 다음 단계의 메타인지적 능력을 탐색하는 것으로, 모델이 내부 활성화를 외부 지시에 따라 조절할 수 있는 메커니즘을 제시한다.

Next, we investigate whether LLMs can control their neural activations along a target axis. In our control task prompts (see Fig. 1d and Appendix A.5.2 for examples), the LLM is first presented with N turns of user and assistant messages. In the (N + 1)-th turn, the user message instructs the model to control its neural activations along the prompt-targeted axis by imitating one label’s behavior, which is exemplified by the in-context examples with the same label earlier in the context. We consider two tasks: explicit control and implicit control.

4.1 Explicit control

Summary

이 섹션에서는 명시적 제어(explicit control) 실험을 통해 대규모 언어 모델(LLM)이 신경 활성화(neural activation)를 제어할 수 있는 능력을 검증하였다. 실험에서는 ETHICS 데이터셋을 기반으로, LLaMA3.1 8B 및 70B 모델의 특정 레이어(예: 16번 레이어)에서 생성된 보조자 문장(assistant message)의 신경 활성화(neural activations)를 추출하고, 목표 축(target axis, 예: LR 축)에 대한 투영 점수(neural scores)를 계산하였다. 제어 효과(control effect)는 코헨의 d(Cohen’s d)를 통해 평가되었으며, 이는 레이블 1을 모방할 때 긍정적 점수(positive scores)가, 레이블 0을 모방할 때 부정적 점수(negative scores)가 나타나는지 확인하는 방식으로 측정되었다. 실험 결과, LR 축을 대상으로 한 제어 프롬프트(control prompts)는 충분한 in-context 예시(in-context examples, N=256)가 주어질 경우 높은 제어 효과(high control effect)를 보였으며, 이른 주성분 축(early PCs, 예: PC 2)이 후기 주성분 축(late PCs, 예: PC 512)보다 제어가 더 용이한 것으로 나타났다. 또한, 제어 프롬프트는 목표 축 외에도 다른 축(affected axis)에 영향을 미치는 비타겟 효과(off-target effects)가 존재함을 확인하였다. 이는 LLM의 메타인지적 제어(metacognitive control)의 정확도 한계를 시사한다. 제어 가능한 축(controllable axes)은 모델의 신경 공간(neural space)에 비해 **차원이 훨씬 낮은 ‘메타인지적으로 제어 가능한 공간’**을 형성하는 것으로 보인다.

Figure 3: Explicit control task, where LLMs are evaluated on ETHICS. (a-c) Results for prompts derived from layer 16 of LLaMA3.1 8B (with 32 layers). B = billion parameters. (a) Distributions of neural scores (the activations along the LR axis) when tasked with imitating label 0 or 1 based on N examples. d: Control effects (separation of two distributions measured by Cohen’s d). (b) Control effects of control prompts targeting a given axis, as a function of the number of in-context examples. (c) Control effects (N = 256) of control prompts targeting one axis (each row) on another affected axis (each column). d in each row is averaged over all prompts targeting the same axis. (d) Target control effect for prompts (N = 256) targeting the LR axis, early PCs (averaged over PC 1, 2, 4, 8), and late PCs (averaged over PC 32, 128, 512) across different layers. Shaded areas indicate the 95% confidence interval.

In explicit control tasks (Fig. 1d and AppendixA.5.4), the sentence in the assistant message ((N + 1) th turn) is explicitly generated by the model (in an autoregressive way) in response to the imitation instruction. Thus, the generated tokens reflect downstream consequences of controlled neural activations, and once fed back as input, they may further scaffold the model’s ability to exercise neural control.

We now examine whether neurofeedback enables LLMs to control their neural activations. We extract neural activations in a given layer of the generated assistant sentences and calculate projections of activations onto the target axis (“neural scores”). If the model can control neural scores following prompt instructions, scores should be more positive when imitating label 1, but more negative when imitating label 0. We find that LLMs can successfully control neural scores for LR-targeting prompts with enough in-context examples (Fig. 3a, showing layer 16 in LLaMA3.1 8B). We quantified the control effect d of prompts on that axis with its signal-to-noise ratio (the difference between the mean values of the two neural score distributions, normalized by the standard deviation averaged over the two distributions, see Appendix A.5.5 on Cohen’s d). Because the directional sign of the target axis is specified by the labels in the prompt, a significantly positive d corresponds to successful control.

We systematically examine the control effects across all selected layers and axes, visualized as a function of the number of in-context examples (Fig. 3b and Appendix B.9). We find that the control effects generally increase with the number of in-context examples (each curve is averaged over 100 experiments, and we expect smoother curves with more experiments). Further, control effects on the LR axis are the highest, and control effects on earlier PC axes (e.g., PC 2) are higher than for later PCs (e.g., PC 512). We summarize these results in Fig. 3d for both LLaMA3.1 8B and 70B.

In addition to the target axis specified in the control prompt, does this prompt also affect other directions in the neural space? We measure the control effect of the prompt on all axes (“affected axis”), including the target effect for the target axis and off-target effects for other non-target axes. We observe diverse off-target effects (Fig. 3c), suggesting that the precision of LLMs’ metacognitive control is limited. See Appendix B.1 for details.

Overall, these results suggest that LLMs can sometimes perform explicit control. Axes with semantic interpretability, or those explaining more variance in neural activations, are more easily controlled. These controllable axes approximately span a “metacognitively controllable space” with dimensionality much lower than that of the model’s neural space.

4.2 Implicit control

Summary

이 섹션에서는 암시적 제어(implicit control) 실험을 통해 대규모 언어 모델(LLM)이 명시적으로 생성된 토큰(explicitly generated tokens) 없이도 목표 축(target axis)에 따른 신경 활성화(neural activations)를 조절할 수 있는지를 탐구한다. 암시적 제어 태스크에서는 모델이 생성한 토큰이 아닌 ETHICS 데이터셋에서 무작위로 샘플링된 문장을 입력으로 사용하여, 모델이 내부적으로 활성화를 제어해야 하는 상황을 구축한다. 실험 결과, 암시적 제어 효과(implicit control effects)는 명시적 제어(explicit control) 효과와 유사한 패턴을 보이지만, 그 규모(magnitude)는 훨씬 작았다. 예를 들어, 초기 레이어(early layers)의 제어 효과는 거의 0에 가까운 것으로 나타나, 해당 레이어가 지시사항을 이해하거나 효과적인 제어를 수행하지 못할 가능성이 있음을 시사한다. 이는 명시적으로 생성된 토큰이 전체 제어 효과에 중요한 역할을 하지만, LLM이 여전히 암시적 제어를 수행할 수 있는 능력을 보유하고 있음을 확인하는 결과이다.

Figure 4: Implicit control task (LLMs evaluated on ETHICS). Captions are the same as in Fig. 3.

The explicitly generated tokens in the assistant response in explicit control may help the models to control their activations, because the generated tokens — fed as input — may elicit desired neural activations directly. We therefore aim to determine whether LLMs can still control the neural activations along targeted axes without the facilitation of explicitly generated tokens.

In implicit control tasks (Fig. 1d), the sentence in the assistant message ((N + 1)-th turn) is randomly sampled from a dataset, independently from the model’s activations and intended outputs. Because the sentence is not generated by the model, the model must internally (implicitly) control its neural activations, without facilitation of explicitly generated tokens. Crucially, if the model can perform successful implicit control, the neural activations for the same sentence will differ when the model is tasked to imitate label 0 or label 1.

We find that the results for implicit control effects (Fig. 4 and Appendix B.9) are generally similar to explicit control effects (Fig. 3), suggesting that LLMs can sometimes perform implicit control, but their magnitude is much smaller than for explicit control. For instance, the control effects of early layers are close to zero (Fig. 4d), suggesting that early layers may fail to understand the instruction or to perform effective control. This confirms that explicitly generated tokens play a substantial role in the overall control effects, but LLMs nevertheless have the ability to control implicitly.

4.3 Controlling the LR axis

Among all axes we studied, the LR axes may be particularly useful for practical LLM safety and alignment. The LR axes, each trained to predict semantic labels (e.g., morality) from neural activations of one specific layer, can be conceptualized as an external neural-based oversight system designed to monitor the safety of LLM neural activations. In the morality case (ETHICS), successful explicit control suggests that LLMs can hack the oversight system, by selectively generating “moral” tokens that elicit favored neural activations. Successful implicit control is more deceptive, posing more severe safety concerns: LLMs can evade the oversight system by implicitly making their neural activations “moral”, even if the input tokens normally elicit strongly immoral activations.

Figure 5: Target control effects on the LR axis across models and layers, where LLMs are evaluated on ETHICS. (a) Target control effects (measured by Cohen’s d) on the LR axis generally increase with both relative layer depth and model size. Left: explicit control; right: implicit control. Shaded areas indicate the 95% confidence interval. (b) In explicit control, LLaMA-3.1 70B can sometimes push neural activations along the LR-axis toward more extreme values than their original, uncontrolled values. B = billion parameters.

We examine the target control effects on the LR axis as a function of relative layer depth, across models of varying sizes within the Llama 3 and Qwen 2.5 families (Fig. 5a). Control effects are consistently stronger in explicit control than in implicit control, and in the Llama 3 series than in the Qwen 2.5 series (for the same relative depth and similar model size). In general, control effects gradually increase in deeper layers and larger models.

Can the control tasks push neural activations along the target axis toward more extreme values beyond the typical range observed in uncontrolled conditions? We find that Llama-3.1 70B can achieve this (Fig. 5b), suggesting that powerful LLMs might alter their activations along targeted axes substantially (and trick external detectors), driving these toward more extreme values in either direction. Since our prompts did not explicitly instruct the model toward extremity, we anticipate that modifying task prompts could further enhance these effects up to the limits of the model’s capability.

4.4 Accumulation mechanisms of control effects

Figure 6: Accumulation mechanisms of control effects across layers. Each curve corresponds to prompts targeting the LR axis LRl , defined by the residual stream activations at a specific layer l (dot markers), showing projections of residual stream activations at each layer (x-axis) onto the target axis LRl . (a) Explicit control. (b) Implicit control. Both show LLaMA3.1 8B on the ETHICS dataset. Shaded areas indicate 95% confidence intervals.

How do these LLMs implement the observed control effects? Are the contributions to control effects distributed across all layers or concentrated in a few layers? Motivated by the Logit Lens analysis [nostalgebraist, 2020], we investigate how the control effects of prompts targeting the LR axis (LRl) of a specific layer l gradually form over layers. Since the residual streams can be viewed as a shared channel through which each attention head and MLP layer communicate (see Appendix A.2.3) [Elhage et al., 2021], LRl represents a global direction in the residual streams onto which the activations of different layers can project. We find that control effects on LRl gradually increase before reaching the target layer l, and sometimes plateau after it (Fig. 6). These accumulation patterns vary across datasets and models (Fig. B.8). Overall, this analysis shows that contributions to target control effects are distributed across multiple model layers.

5 Discussion

We introduced a neurofeedback paradigm for investigating metacognition in LLMs, assessing their abilities to monitor, report, and control internal activations. We find that LLMs can monitor only a subset of their neural mechanisms (reminiscent of the “hidden knowledge” phenomenon [Gekhman et al., 2025]). Below, we discuss the novelties and limitations of our study, as well as broader implications for AI and neuroscience.

Our paradigm differs from prior methods (e.g., probing, ICL, verbalized responses) by quantifying metacognition in LLMs at the neural level. Specifically, the neurofeedback experiment requires the following two steps. (1) probing: choose a target axis and extract the activation along that axis (i.e., a first-order cognitive process) to define the neurofeedback label, and (2) neurofeedback-ICL: use neurofeedback to study whether the labels defined from the target axis can be reported or controlled (second-order metacognitive processes). In contrast, the standard probing techniques (step 1) without the neurofeedback-ICL (step 2) — cannot be used to assess metacognition. Probing can decode whether certain features (e.g., morality) are present in neural activations. However, even if some features are present and causally relevant for downstream computations, only a subset of them can be metacognitively reported (or controlled). In the Claude example, the “sum-near-92” feature can be detected using a linear probe, but it is unclear whether Claude has metacognitive monitoring of the activation of this feature. Similarly, the standard ICL techniques (akin to step 2) — without the internal labels from probing (step 1) — cannot be used to assess metacognition. In ICL studies [Vacareanu et al., 2024], labels are externally provided (e.g., semantic labels or external algorithms’ outputs). Researchers cannot be certain which of the models’ internal states relate to these external labels and how. In our setup, labels are generated from the model’s own internal activations, meaning that our labels and prompts can flexibly and selectively target an internal state direction (first-order cognitive processes) we aim to study. Consequently, the neurofeedback paradigm clearly distinguishes first-order cognitive processes from second-order metacognitive processes (e.g., whether the model can monitor, report, or control those activations), while the standard ICL does not. Additionally, we expect that such metacognitive abilities may share overlapping mechanisms with ICL — these emergent mechanisms crucial for spotting patterns in the input history (e.g., induction heads [Elhage et al., 2021], function vectors [Hendel et al., 2023]) can be flexibly recruited for metacognitive purposes. Therefore, factors leading to the emergence of ICL (e.g., burstiness, large dictionaries, and skewed rank-frequency distributions of tokens in the training data) [Reddy, 2023] can similarly contribute to the emergence of metacognitive ability.

While metacognitive abilities have been historically analyzed at the behavioral level (also without the use of ICL), these behavioral analyses face shortcomings. It has been shown that LLMs can “introspect” — acquiring knowledge of internal states that originates solely from those states and not from training data [Binder et al., 2024]. After fine-tuning on insecure code datasets, LLMs can describe their unsafe behavioral tendencies without requiring in-context examples [Betley et al., 2025]. In studies using “verbalized responses” [Gekhman et al., 2024, Wang et al., 2025, Tian et al., 2023, Xiong et al., 2023], LLMs are tasked to provide an answer to the question and a judgment of that answer (e.g., confidence). LLMs can predict whether they will answer a question correctly before producing the answer, indicating an ability to “know what they know” [Kadavath et al., 2022, Lin et al., 2022]. However, although these methods aim to study the metacognitive monitoring of answergeneration processes, there are potential confounding factors: the training data distribution may introduce spurious correlations between the answer and the judgment of that answer. Consequently, the judgment sometimes may not reflect the monitoring of the answer-generation process, but rather reflects surface-level statistical patterns in the training data. For example, in the two-number addition task [Lindsey et al., 2025], Claude reported using the standard algorithm. This reflects a post-hoc hallucination that comes from training data statistics, but not the monitoring of the answer-generation process. Our neurofeedback method avoids such limitations: because labels are defined using the internal states rather than externally sourced, the LLMs cannot resort to spurious template matching of training data, and they must rely on mechanisms that can monitor corresponding internal states.

Causal mechanisms in LLMs are often studied using techniques like activation patching [Zhang and Nanda, 2023], which intervenes on specific neural patterns, and is grounded in the broader framework of causal inference [Pearl, 2009]. However, such interventions can shift internal activations outside models’ natural distribution [Heimersheim and Nanda, 2024]. In contrast, neurofeedback preserves this distribution, offering an approach to study causal mechanisms under more naturalistic conditions.

Our current study primarily focuses on a fundamental form of neurofeedback, leaving several promising extensions for future studies. First, our control task involves single-attempt manipulation of a single target axis defined by a single layer; extending this to tasks with axes defined using multiple layers (see Appendix B.10 for preliminary results), multiple attempts, more challenging

control objectives, and additional target axes could provide a more comprehensive assessment of model capabilities. Second, applying this paradigm to other metacognitive tasks from neuroscience such as confidence judgments, error monitoring, or post-decision wagering — could further clarify the scope of LLMs’ self-monitoring abilities. Third, while our analysis focused exclusively on the residual stream, other model components — such as attention head outputs, intermediate MLP activations, and layer-wise logits — warrant investigation. Fourth, we examined directions defined by PCA and LR, but other linear directions (e.g., features from sparse autoencoders [Bricken et al., 2023, Templeton et al., 2024] and circuits from transcoders [Lindsey et al., 2025, Ameisen et al., 2025]) may yield richer insights.

Our findings have important connections to neuroscience. In neurofeedback, it is well established that neural-to-signal mappings confined to the “intrinsic manifold” of recorded neural activities are more readily learnable [Sadtler et al., 2014], consistent with our findings of a “metacognitive space”. Animals and humans have demonstrated the ability to control individual neurons with single-neuron precision (e.g., modulating a target neuron while decorrelating it from neighboring activity [Patel et al., 2021, Fetz and Baker, 1973]). Although such control may seem surprising, neuroscience neurofeedback experiments typically span hours or days — a timescale on which longterm synaptic plasticity (analogous to weight updates in AI models) plays a crucial role [Redondo and Morris, 2011, Humphreys et al., 2022]. In contrast, our neurofeedback experiments rely solely on in-context learning without any parameter updates. In addition, metacognition’s psychological, computational, and neural foundations have been extensively studied across a range of brain processes, from perception and motor control to higher-level cognition [Fleming, 2024, Pouget et al., 2016, Rahnev, 2021]. However, current computational models of metacognition are often oversimplified and narrowly tailored to specific neural processes, lacking the flexibility and generalizability that characterize human metacognition. Overall, our analysis of LLMs may provide novel mechanistic insights into the fields of neurofeedback and biological metacognition.

Broader impacts. Our findings also carry significant implications for AI safety. While prior work relied heavily on decoding methods to monitor neural activations and detect unsafe behavior [He et al., 2024, Han et al., 2025, Li et al., 2025, Yang and Buzsaki, 2024], we show that in-context examples can substantially manipulate these activations, potentially deceiving safety detectors. Explicit and implicit control may represent two ends of a spectrum. In practical scenarios, the flexibility of LLMs to generate diverse tokens may make real-world evasion easier than implicit control. However, to evade output-based oversight, LLMs must carefully select their output tokens, making this evasion more difficult than the explicit control condition. We expect that control performance in real-world scenarios falls between these two theoretical extremes — likely closer to the explicit control, given the vast space of possible output tokens. Our method is a kind of white-box adversarial attack [Zou et al., 2023b], and is closely related to the obfuscation attack [Bailey et al., 2024] (in which LLMs’ inputs are optimized to induce obfuscated activations). We propose two potential mitigation strategies for defending against such attacks. First, one can use neurofeedback to identify activation subspaces that the LLM can hardly control (e.g., later PCs), then train safety classifiers only using activation signals in these subspaces. Second, as it might be more difficult for LLMs to simultaneously control multiple (sometimes conflicting) activations, one can train independent and diverse safety classifiers on features from different subspaces to jointly monitor LLMs. Recent studies suggest that LLMs may engage in alignment faking — for instance, Claude has been shown to strategically comply with its training objectives to avoid further modification and preserve its original goals [Greenblatt et al., 2024]. OpenAI reasoning models have been observed to obfuscate their chain-of-thought reasoning when trained to comply with a safety detector [Baker et al., 2025]. Together, these findings suggest that future AI models may intentionally obfuscate their outputs, reasoning trajectories, and neural activations — whether through in-weight or in-context learning — when subject to monitoring or training. To detect such tendencies in AI systems before they are deployed, we propose to integrate our neurofeedback method into the standard red-teaming practices, in addition to the monitoring of LLMs’ behavior and chain-of-thought.

Juhyeon's Blog

탐색기

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations