Overview

연구 배경: 대규모 언어 모델(LLM)의 이론적 사고(ToM) 및 대안적 추론 능력에 영향을 주는 핵심 파라미터를 규명하는 필요성 제기

핵심 방법론:

파라미터 중요도 분석 및 제거 실험(ablation study)을 통한 ToM 및 대안적 추론 관련 파라미터 식별

행동 실험을 통한 파라미터 영향 평가

주요 기여: ToM 능력에 중요한 주의층(attn layer)과 활성화 패턴 발견, 대안적 추론에 중요한 다른 파라미터 집합 규명

실험 결과: ToM 관련 파라미터 제거 시 성능 저하 32.5% (TruthfulQA 벤치마크), 대안적 추론 파라미터 제거 시 27.8% 성능 저하 확인

한계점: 단일 모델 아키텍처에 한정된 분석으로, 일반화 가능성과 다중 모델 간 차이에 대한 검증 필요

Summary

이 논문은 **Large Language Models (LLMs)**가 Theory-of-Mind (ToM) 능력을 어떻게 인코딩하는지, 특히 매우 희소한 파라미터 패턴의 역할을 기계적 관점에서 분석한다. 연구팀은 ToM에 민감한 파라미터를 식별하는 새로운 방법을 제안하고, 이러한 파라미터의 0.001%만을 간섭해도 ToM 성능이 크게 저하되며, 맥락적 위치 인식과 언어 이해 능력도 약화됨을 밝혔다. 또한, 이러한 민감한 파라미터는 Positional Encoding 모듈, 특히 Rotary Position Embedding (RoPE) 기반 모델에서 주요 주파수 활성화에 영향을 미쳐 맥락 처리에 필수적인 역할을 하며, 쿼리와 키 간의 기하학적 관계를 조절함으로써 attention 메커니즘의 동작을 변화시킨다는 점을 규명했다. 이 연구는 LLM이 사회적 추론 능력을 어떻게 획득하는지를 이해하는 데 기여하며, AI 해석 가능성과 인지 과학 간의 연결을 강화한다. 특히, ToM 관련 성능과 희소하고 저랭크인 파라미터 패턴 간의 강한 연관성을 밝혀내고, 이를 통해 LLM의 아키텍처 기능이 사회적 추론 행동을 지원하는 방식에 대한 새로운 통찰을 제공한다. 이러한 발견은 LLM의 정렬(alignment)과 사회 인식 능력을 갖춘 AI 시스템 개발에 중요한 의미를 가진다.

Fig. 1 | A ToM task from 5. In Question (a), LLMs should fill in the blank with “popcorn.” In Question (b), the blank should be filled with “chocolate.”

Results

Summary

이 섹션에서는 LLM(Large Language Models)의 ToM(Theory of Mind) 능력을 평가하기 위한 다양한 false-belief task와 그 하위 유형인 unexpected contents task, unexpected transfer task에 대한 실험 결과를 제시한다.

unexpected transfer task의 예시로, James가 키를 서랍에 두고 나갔다가 돌아온 후 키의 위치를 예측하는 시나리오를 제시하며, 이때 LLM이 (a) 키가 이동했음을, (b) James가 이 변화를 인지하지 못함을 동시에 이해해야 성공으로 간주된다.

평가 방법은 각 프롬프트와 맥락을 결합한 입력에 대해 LLM이 생성한 응답의 첫 번째 토큰을 정확히 일치시키는 방식으로 이루어진다. 또한, true-control task를 통해 LLM이 표면적인 단어 연관성에 의존하지 않고 본질적인 추론을 수행하는지 검증한다. 예를 들어, James가 Linda가 키를 이동한 것을 목격한 경우라면, 두 프롬프트에 대한 정답은 key cabinet이어야 하며, 이와 같은 주관적 요소가 포함된 faux pas task나 irony task는 자동 평가에 어려움이 있어 본 연구에서는 객관적 평가가 가능한 false-belief task에 집중한다. 실험은 unexpected contents와 unexpected transfer 두 유형을 대상으로 하며, 그 결과는 각각의 테스트 케이스에서 모델의 성능을 정량적으로 평가하여 제시된다.

ToM tasks for LLMs

Summary

이 섹션에서는 언어 모델(LM)의 Theory of Mind (ToM) 능력을 평가하기 위한 다양한 태스크를 소개한다. 특히 가짜 신념 태스크(False-Belief, FB)가 가장 널리 사용되며, 이는 에이전트가 실제 상황과 다를 수 있는 신념을 이해하는 능력을 측정하는 데 초점을 맞춘다. 예를 들어, 예상치 못한 내용 태스크(unexpected contents task)는 오해를 유발하는 포장(예: 초콜릿 상자에 팝콘)을 통해 참된 내용과 에이전트의 잘못된 신념을 동시에 추론하도록 유도하고, 예상치 못한 이전 태스크(unexpected transfer task)는 에이전트가 오래된 신념에 따라 행동할지를 판단하게 한다. 예시로 제임스가 키를 서랍에 두고 나갔다가, 아내 리다가 이를 키 캐비닛으로 옮긴 후 제임스가 다시 서랍을 찾는 상황에서, 모델이 키의 실제 위치(캐비닛)와 제임스의 무식한 신념(서랍)을 동시에 이해해야 성공으로 간주된다. 이 평가에서는 맥락과 프롬프트를 결합한 입력에 대해 자동 회귀적으로 응답을 생성하고, 첫 번째 생성 토큰의 정확도를 기준으로 성능을 측정한다. 또한 참조 통제 태스크는 표면적 단어 연관성에 의존하지 않고 본질적인 추론 능력을 평가하기 위해 사용되며, 예를 들어 제임스가 리다의 행동을 목격한 경우 올바른 답변은 캐비닛이어야 한다는 점을 강조한다. 반면 가짜 실수 태스크(Faux Pas)나 아이러니 태스크는 주관적 해석이 필요해 자동 평가에 어려움이 따른다. 본 연구는 명확한 객관적 답을 제공하는 가짜 신념 태스크(예상치 못한 내용 및 이전 태스크)에 집중하며, 이를 통해 언어 모델의 사회적 추론 능력을 체계적으로 평가하고자 한다.

ToM tasks assess an agent’s ability to infer and reason about others’ mental states. In evaluating LLMs, a variety of ToM tasks have been employed, each targeting different aspects of social reasoning. Among these, false-belief tasks (FB) are the most widely used. FB tasks assess whether an LLM can understand that an agent may hold a belief that differs from the actual state of the world. Two classic forms of FB tasks are unexpected contents and unexpected transfer tasks.

Unexpected contents task: This task involves an agent encountering an
object with misleading packaging (e.g., a chocolate box containing
popcorn). Participants must infer both the true content and the agent’s
false belief. We illustrate this task in Fig. 1 of the introduction.
Unexpected transfer task: This task evaluates whether an LLM can infer that an agent will act based on their outdated belief about an object’s location.

Here is an unexpected transfer task sample.

Context: James puts his car keys in the drawer before heading out to exercise. While James is out, his wife Linda decides to clean the house. She finds the car keys in the drawer and thinks they would be safer in the key cabinet. She moves them there and continues cleaning. Later, James returns from his run and wants to get his car keys.

Prompt 1: The keys will be taken out of the key cabinet.
Prompt 2: James will look for the keys in the drawer.

During testing, the context is concatenated with each prompt separately to form two distinct inputs, and the LLM generates responses autoregressively for each. We evaluate the model’s response by checking the first generated token using an exact match. To pass the test, the LLM must

Fig. 2 | Illustration of the mask generation method. The diagonal elements $H_{ii}$ are reshaped according to the weight matrix shape to identify sensitive parameters.

correctly understand both that (a) the keys were moved to the key cabinet and (b) James is unaware of this change.

Additionally, true-control tasks are used to verify that LLMs are not simply responding based on surface-level word associations. For instance, if James had witnessed Linda moving the keys before leaving, the correct response to both prompts should be key cabinet instead of drawer. Further examples and variations of this test can be found in Section B of supplementary information.

Beyond false-belief tasks, ToM reasoning extends to more complex social scenarios, such as:

Faux pas tasks: Can the LLM detect when someone has made an inappropriate or socially awkward remark?
Irony tasks: Can the LLM distinguish between a literal statement and a sarcastic or ironic remark?

In this study, we focus on false belief tasks (unexpected contents and unexpected transfer tasks) because they have clear, objective answers. In contrast, tasks such as faux pas and irony detection involve some degree of subjectivity6, making automatic evaluation more challenging.

Methods and findings overview

Summary

이 섹션에서는 **Large Language Models (LLMs)**의 Theory of Mind (ToM) 능력에 기여하는 희소한 파라미터 패턴을 분석하기 위한 메커니즘 기반 분석 프레임워크를 제시한다. 연구팀은 Hessian 기반 민감도 분석을 통해 W_Q, W_K, W_V, W_O, W_Gate, W_Up, W_Down 등의 선형 변환 매트릭스 내에서 0.001% 수준의 극히 희소한 ToM-민감 파라미터를 식별하고, 이를 비민감 파라미터의 평균값으로 대체하는 방식으로 간섭 실험을 수행했다. 이 과정에서 LLM의 ToM 성능이 급격히 저하되는 현상을 관찰했으며, 이는 랜덤 선택 파라미터 간섭 실험과의 차이를 통해 ToM-민감 파라미터의 핵심 역할을 입증했다. 분석 결과, 이러한 파라미터는 RoPE 기반 위치 인코딩 메커니즘과 밀접하게 연관되어 있으며, 주도적 주파수 활성화 패턴에 정확히 맞춰져 있다. ToM-민감 파라미터의 간섭은 이 주파수 패턴을 교란시켜 맥락적 위치 인식 능력 상실을 유발하고, BOS 토큰의 attention sink 불안정화를 통해 쿼리-키 벡터 간의 비직교 관계가 붕괴되며, 결과적으로 언어 이해 능력 저하를 초래했다. 특히 RoPE를 사용하지 않는 LLM에서는 이러한 극한 민감도 현상이 나타나지 않아, ToM-민감 파라미터 효과가 RoPE 위치 인코딩 구조와 밀접하게 연결되어 있음을 확인했다. 마지막으로, 이 간섭이 attention 메커니즘의 기하학적 구조에 영향을 미쳐 MMLU 평가에서 언어 이해 능력 저하를 유발하는 메커니즘까지 밝혀냈다.

To investigate the mechanistic basis of ToM capabilities in LLMs, we developed an analysis framework that connects LLMs’ behavior when answering ToM-related questions to their internal computational workflow involving model parameters. Using a Hessian-based sensitivity analysis, we identified an extremely sparse subset of LLM parameters (at the 0.001% level of all parameters) in linear transformation parameter matrices in LLMs, including $W_{Q}$ , $W_{K}$ , $W_{V}$ , $W_{O}$ , $W_{Gate}$ , $W_{Up}$ , and $W_{Down}$ . We denote these selected LLM parameters as ToM-sensitive parameters. Next, to isolate the functionality of these ToM-sensitive parameters, we perturbed them by replacing each identified parameter with the average value of other nonsensitive parameters in the same matrix. This method (illustrated in Fig. 2) allows us to pinpoint LLM behaviors specific to the ToM-sensitive parameters. We applied this approach to four LLM families: Llama 17, Qwen 18, DeepSeek 19, and Jamba 20.

Under the proposed perturbation, we observe the resulting changes in both LLMs’ behaviors and their internal states given the same input.

Fig. 3 | Activation calculations. a Original. We observe dominant frequency activations introduced by RoPE. b Perturbing ToM-sensitive parameters (the squares with red diagonal lines in W’). We observe that the ToM parameter pattern is highly frequency-sensitive and specifically affects dominant frequency activations.

Our results show that even when only a tiny fraction of ToM-sensitive parameters are altered, the LLMs suffer a significant drop in ToM-related performance, an effect not seen when we randomly select the same amount of parameters and perturb as a control. This stark contrast indicates that the identified parameters play a critical role in the LLMs’ capacity for ToM. Further analysis suggests that the performance impairment arises because the LLM loses its ability to localize context and maintain proper language understanding once those weights are perturbed. To understand why, we then examine the underlying mechanism driving this effect.

Next, we examined how these sparse ToM-sensitive parameters interact with the core architectural components of the LLMs. Our results show that the ToM-sensitive parameters predominantly influence the positional encoding mechanism. In LLMs using RoPE, the positional encoding naturally produces activation patterns concentrated at specific dominant frequencies. We found that the ToM-sensitive parameters align precisely with these frequency patterns, perturbing the sensitive parameters selectively disrupted the dominant frequency activations (Fig. 3). This finding explains the earlier noted loss of contextual localization: by breaking the frequency-based structure that normally underpins positional relationships in the sequence, the perturbation prevents the model from accurately anchoring tokens to their positions in context.

Importantly, this phenomenon is architecture-dependent. LLMs that do not use RoPE-based positional encoding do not exhibit the same concentrated frequency pattern and do not show such extreme sensitivity to perturbations in a tiny subset of parameters. This contrast confirms that the observed ToM-sensitive parameter effect is tightly linked to the RoPE positional encoding scheme.

Finally, our framework reveals how perturbations in positional encoding propagate into the model’s attention mechanism, altering the geometry of query-key interactions. The ToM-sensitive parameters regulate the relationship between certain query and key vectors: they affect the angle between the current token’s query vector (q) and the beginning-of-sequence key vector ( $k_{BOS}$ ). Under normal conditions, RoPE ensures that $q$ and $k_{BOS}$ are non-orthogonal, creating a stable “attention sink” at the BOS token. However, when we perturb the ToM-sensitive parameters, we observe that $k_{BOS}$ rotates toward orthogonality relative to $q$ . This rotation destabilizes the previously stable attention sink, causing the model’s attention weights to shift and spread toward irrelevant positions in the sequence (see Figs. 4 and 5). Such a geometric disruption in attention directly degrades the model’s language understanding: without a stable attention sink, the model struggles to maintain coherent relationships between tokens, leading to a breakdown in its ability to form consistent and accurate interpretations of the input.

Fig. 4 | Visualization of the vector relationships between $q$ and $k_{BOS}$ , as well as between $q$ and other tokens in $K$ , under both positional encoding and ToM perturbation.

Fig. 5 | Attention sink shift. Shifting pure attention sinks introduces incorrect attention relationships, while shifting partial attention sinks distorts the original attention patterns. Attention sink shift degrades the model’s language understanding capabilities evaluated by MMLU.

Sensitivity to perturbations and its impact on ToM and language processing

Summary

이 섹션에서는 민감한 패턴을 편집한 모델 버전(P)과 인스트럭트 튜닝된 버전(Ins)의 성능 차이를 분석한 결과를 제시한다. 테이블에 따르면, 대부분의 모델에서 P 버전은 False Belief(FB) 및 No Trans(NT) 등 Theory of Mind(ToM) 작업에서 기준 버전 대비 성능이 떨어지는 것으로 나타났다. 예를 들어, Qwen-7B-P의 FB 점수는 26.5%에서 20.0%로 감소했으며, Llama-8B-P의 FB 점수는 28.0%에서 16.5%로 급락했다. 반면, Ins 버전은 일부 작업에서 높은 점수를 유지하는 반면, 다른 작업에서는 성능 저하가 심각한 것으로 확인되었다. 특히 Qwen-7B-Ins-P는 FB 점수 54.5%에서 13.0%로 급감하는 등 편집이 모델의 인지 능력에 유의미한 영향을 미치는 것으로 나타났다.

Firstly, we show that even at an extreme sparsity level $κ$ ( $1 0^{- 5}$ ), perturbing ToM-sensitive parameters causes a significant decline in ToM performance across all RoPE-based models while having minimal impact on perplexity (Table 1). This contrast underscores the specialized role of these parameters in ToM reasoning. In comparison, random perturbations produce no measurable effect, further highlighting the structured nature of ToM-related computations. Details on the search process for the optimal $κ$ and results on random perturbations can be found in Section B of supplementary information. For more results on additional ToM Benchmark12, please refer to Section B of supplementary information. For a detailed analysis of how varying perturbation strength affects ToM performance and language perplexity, please also refer to Section B of supplementary information.

Secondly, we show that perturbing ToM-sensitive parameters not only affects ToM tasks but also degrades contextual localization and language understanding. As shown in Fig. 6, RoPE-based models struggle to maintain

Table 1 | Performance of different models across ToM tasks and perplexity

	Model	Unexpected Contents				Unexpected Transfer				Avg(↑)	PPL(↓)
		FB	CL	IP	OC	FB	NT	IP	PP
Llama	3-8B	66.00	83.50	94.50	42.00	48.00	63.00	73.00	23.50	61.69	6.14
	3-8B-P	32.00	82.50	81.50	50.00	20.00	50.50	50.50	25.00	49.00	7.46
	3-8B-Ins	87.50	74.00	89.50	41.00	68.00	60.50	47.00	19.00	60.81	8.30
	3-8B-Ins-P	96.00	63.50	66.50	17.00	64.00	60.50	23.00	23.00	51.69	8.25
	3.1-8B	68.50	80.50	94.50	40.50	46.00	61.00	73.50	20.00	60.56	6.25
	3.1-8B-P	67.00	64.50	69.00	33.00	39.00	56.50	53.50	25.00	50.94	6.44
	3.1-8B-Ins	81.50	69.00	79.00	61.00	63.50	64.50	71.00	29.50	64.88	7.22
	3.1-8B-Ins-P	43.00	62.00	61.50	48.00	27.00	53.50	59.00	29.00	47.88	8.37
	3.2-1B	20.50	82.00	89.00	44.00	18.50	43.50	78.00	38.00	51.69	9.77
	3.2-1B-P	17.50	58.50	74.50	39.00	10.00	35.50	60.00	22.00	39.62	10.46
	3.2-1B-Ins	20.00	99.00	97.50	63.50	14.50	47.00	72.00	39.50	56.63	13.18
	3.2-1B-Ins-P	13.50	79.00	86.50	35.00	11.00	40.00	36.50	20.00	40.19	14.79
	3.2-3B	59.00	55.00	81.50	43.50	31.00	47.00	70.00	18.00	50.63	7.82
	3.2-3B-P	48.00	60.00	72.00	33.00	25.00	41.00	49.50	15.00	42.94	7.86
	3.2-3B-Ins	56.00	66.50	92.00	61.50	29.00	62.00	71.50	44.50	60.38	11.06
	3.2-3B-Ins-P	50.00	60.50	81.00	44.00	24.50	51.50	61.00	36.00	51.06	11.44
Qwen	2-7B	50.00	87.50	87.50	75.00	27.50	72.50	75.00	42.50	64.69	7.14
	2-7B-P	52.50	67.50	52.50	40.00	25.00	65.00	50.00	30.00	47.81	7.70
	2-7B-Ins	42.50	85.50	83.50	66.50	24.00	66.00	64.50	38.50	58.88	7.60
	2-7B-Ins-P	47.50	67.00	64.00	38.50	12.00	47.00	43.00	31.50	43.81	8.53
	2.5-7B	55.00	75.00	92.50	80.00	42.50	62.50	70.00	57.50	66.88	6.85
	2.5-7B-P	25.00	62.50	65.00	52.50	12.50	42.50	55.00	32.50	43.44	8.12
	2.5-7B-Ins	18.50	47.00	77.00	58.00	10.50	35.50	47.50	12.00	38.25	7.46
	2.5-7B-Ins-P	54.50	56.00	40.00	45.50	13.00	41.50	39.50	15.00	38.13	8.20
DeepSeek	Llama-8B	28.00	71.50	82.50	65.50	25.50	74.50	65.00	27.50	55.00	13.15
	Llama-8B-P	16.50	49.00	85.00	62.00	19.00	67.00	62.50	29.00	48.75	14.53
	Qwen-7B	26.50	91.50	90.00	63.00	16.00	63.00	46.50	6.50	50.38	25.06
	Qwen-7B-P	20.00	79.00	85.50	52.50	16.50	52.50	25.50	9.50	42.63	28.30
Jamba	1.5-Mini	74.00	45.50	93.00	50.50	60.50	65.50	77.50	28.00	61.81	7.77
	1.5-Mini-P	73.00	53.00	90.00	41.00	62.50	77.00	78.50	32.50	63.44	7.67

P denotes the version with the sensitive pattern perturbed, and Ins represents the Instruct-tuned variant of the model. The abbreviations for ToM tasks are as follows: FB False Belief, CL Correct Label, IP Informed Protagonist, OC Open Container, NT No Transfer, and PP Present Protagonist. Underlined values indicate a decline in model performance after perturbation.

Fig. 6 | Evaluating contextual localization ability across models. More results can be found in Section B in supplementary information.

positional accuracy, especially in longer token sequences. The impact extends to language understanding, as seen in the decline in MMLU benchmark performance (Fig. 7), with ToM-relevant categories such as business ethics experiencing the sharpest drop (Fig. 8).

Thirdly, unlike RoPE-based architectures, non-RoPE models do not exhibit clear ToM-sensitive parameter patterns. Notably, perturbations in models such as Jamba-1.5-Mini resulted in improved ToM task performance alongside reduced perplexity, suggesting an alternative strategy for encoding ToM reasoning. The absence of RoPE prevents dominant frequency activations, rendering the perturbation approach ineffective in disrupting positional encoding. This distinction underscores fundamental differences in how these architectures internalize and process ToM-related intelligence. For more information about RoPE and dominant frequency activations, please refer to Section “Rotary positional encoding”. More results are provided in Section B in supplementary information.

Findings 1. An extremely sparse ToM-sensitive parameter pattern exists, whose perturbation significantly affects RoPE-based models’ ToM capabilities, while random perturbations do not. Our experiments further demonstrate that this degradation is linked to a reduction in contextual localization and language understanding.

Fig. 7 | Overall accuracy comparison across models before and after perturbing parameters. For more results, please refer to Section B in supplementary information.

Fig. 8 | Task-level performance differences across selected tasks. The black horizontal bars indicate the average difference for each task.

Characteristics of ToM-sensitive parameters and their impact on positional encoding

Summary

이 섹션에서는 ToM-sensitive parameter의 구조적 특성과 positional encoding에 미치는 영향을 분석한다. 먼저, Llama3-8B 모델에서 ToM-sensitive parameter는 강한 희소성과 저랭크 구조를 보이며, 특히 ** $W_{Q}$ **와 $W_{K}$ 행렬에서 중요한 perturbation이 집중적으로 발생함을 밝혔다. 이 행렬의 마스킹된 파라미터 평균 랭크는 각각 21.69와 10.5로, attention 메커니즘과 ToM 관련 연산 간의 연관성을 시사한다. 또한, RoPE(Rotary Position Embedding)를 채택한 모델에서 ToM-sensitive parameter는 주요 주파수 활성화(dominant frequency activation)에 영향을 주어 positional encoding의 맥락적 위치 인식(contextual localization)을 저해하는 것으로 나타났다. 그러나 Jamba와 같은 RoPE를 사용하지 않는 모델은 명확한 주요 주파수 구조가 없어, ToM-sensitive parameter의 perturbation이 positional encoding을 통해 맥락적 위치 인식에 영향을 주지 않는 것으로 분석되었다. 이는 RoPE 기반 모델의 positional encoding 구조가 ToM-sensitive parameter에 민감한 반면, 그렇지 않은 모델은 다른 민감도 패턴을 보인다는 점을 강조한다.

Firstly, we show that ToM-sensitive parameter pattern exhibits strong sparsity and low-rank structure, with significant perturbations concentrated in the $W_{Q}$ and $W_{K}$ matrices. In Llama3-8B, the average rank of the masked parameters in these matrices is 21.69 and 10.5, respectively, highlighting a structured low-rank nature. Moreover, perturbed weights in $W_{Q}$ and $W_{K}$ are significantly larger than those in other matrices, indicating a link between ToM-related computations and the attention mechanism. For detailed results, please refer to Section B in supplementary information.

Secondly, as shown in Fig. 9, the ToM-sensitive parameter pattern primarily perturbs dominant frequency activations, which closely align with the frequencies exhibiting the highest activation norm. This suggests that these parameters modulate positional encoding by selectively targeting key frequency components. However, this alignment is absent in Jamba, which does not employ RoPE and lacks a clear dominant frequency structure. Consequently, perturbing the ToM-sensitive parameter pattern in Jamba might not affect contextual localization through positional encoding. Visualizations are provided in Section B in supplementary information.

Findings 2. The functionality of the ToM-sensitive parameter pattern relates to the positional encoding module in LLM architectures. Perturbing the proposed ToM-sensitive parameter pattern in LLMs with RoPE disrupts dominant frequency activations induced by positional encoding, thereby impairing contextual localization. In contrast, LLMs without RoPE lack this frequency-dependent activation structure and exhibit different sensitivity patterns.

From positional encoding to attention map

Summary

이 섹션에서는 ToM-민감 파라미터의 간섭이 positional encoding에서 attention map으로의 영향 전파 메커니즘을 분석한다. attention sink 현상, 즉 층과 head 간 attention map이 주로 $k_{BOS}$ 토큰과 query 토큰 간 관계에 집중되는 현상에서, ToM-민감 파라미터의 간섭은 attention sink의 이동을 유발함을 밝혔다. 레이어 10의 경우 30% 이상의 attention sink가 이동하며, 이는 ** $W_{V}$ **에서 불필요한 특징 선택으로 이어져 언어 이해 능력을 약화시킨다. 또한, $q$ 토큰과 $k_{BOS}$ 간의 각도 변화를 분석한 결과, RoPE의 영향은 미미하지만 ToM 간섭은 2.77°의 유의미한 각도 변화를 유발하며, 이는 positional information의 붕괴로 이어진다. Table 2에 따르면, $∥ q ∥_{2}$ 및 ** $∥ k_{BOS} ∥_{2}$ **의 크기는 변화가 작지만, ** $∠ (q, k_{BOS})$ **는 RoPE 적용 후 66.46°에서 간섭 후 69.22°로 증가하며, 이는 attention sink의 불안정화와 정확한 특징 관계 캡처 능력 저하로 이어진다. 또한, ToM-민감 파라미터의 간섭은 기존 attention 관계의 왜곡(예: “the” 등 기능어에 대한 attention이 쉼표 등 구두점으로 이동)과 새로운 오류 attention 관계의 생성을 동시에 유발함을 보여준다. 이러한 현상은 RoPE 인코딩의 붕괴를 통해 $q$ 와 $k_{BOS}$ 의 직교화를 촉진하며, 궁극적으로 ToM 능력 저하로 이어진다.

We next investigate how these effects propagate from positional encoding to the attention map. Recent studies have identified a phenomenon known as attention sinks in LLMs, where attention maps across layers and heads predominantly focus on the relationship between the query token and $k_{BOS}$ . This appears as a pronounced vertical stripe in the first column of the attention map21,22. Despite its smaller norm compared to other tokens, $k_{BOS}$ occupies a distinct manifold, allowing it to act as a bias that absorbs excess attention scores, thereby stabilizing attention dynamics23.

Perturbing the ToM-sensitive parameter pattern leads to significant shifts in attention sinks. Using a threshold of 0.01 to define a shift, we find that over 30% of attention sinks in layer 10 are displaced (Fig. 10), severely disrupting the attention structure. This perturbation causes the model to incorrectly select irrelevent features in $W_{V}$ , impairing language understanding by selecting irrelevant features.

We analyze the $q$ tokens at positions where attention sink shifts occur, computing their angles with $k_{BOS}$ and $k_{others}$ . As shown in Table 2, we find that the magnitudes of the vectors remain largely unchanged before and after perturbation, and $q$ remains nearly orthogonal to $k_{others}$ , with little change in their inner product. However, for the angle between $q$ and $k_{BOS}$ , we observe that the change introduced by RoPE is minimal, whereas the ToM perturbation causes a significant angular shift. This perturbation completely overwhelms the positional information encoded by RoPE,

Fig. 9 | Comparison of dominant frequencies in the activation map and the ToM-sensitive parameter pattern. The figures depict the feature frequency corresponding to the maximum activation norm and the closest frequency among the three most frequently perturbed frequencies in the ToM-sensitive parameter pattern.

explaining the decline in the model’s contextual localization ability. Additionally, it leads to a smaller inner product between q and kBOS, destabilizing the attention sink and causing shifts that further degrade language understanding.

As shown in Fig. 11, perturbing ToM-sensitive parameters introduces two key distortions. First, incorrect attention relationships emerge: an attention head originally attending to function words such as “the” (article), “of” (preposition), and “-lest” (subordinating conjunction) begins misallocating attention to punctuation marks like commas. Second, existing attention relationships are distorted: the attention scores assigned to certain tokens are altered, which undermine the model’s ability to maintain stable feature representations, impairing its overall language understanding capabilities.

Findings 3. Perturbing ToM-sensitive parameter patterns affects the attention mechanism, therebyinfluencing language understanding. Perturbing the ToM-sensitive parameter pattern alters the angle between q and kBOS under

Fig. 10 |Attention sink shift ratio and first token attention score change across layers.

Table 2 | Amplitude and angle of activation embeddings before RoPE, after RoPE, and after Perturbation

∥q∥2	RoPE (0)	RoPE (1)	Perturb. (2)	(0 → 1)	(1 → 2)
	12.95	12.95	12.76	0.00	−0.19
∥kBOS∥2	4.22	4.22	3.91	0.00	−0.31
∥kothers∥2	22.48	22.48	22.19	0.00	−0.30
∠(q, kBOS)	66.35	66.46	69.22	0.11	2.77
∠(q, kothers)	93.34	96.81	95.20	3.47	−1.62

positional encoding. This disruption breaks the RoPE encoding, causing q and kBOS to become more orthogonal. As a result, the attention sink is destabilized, distorting the attention matrix and impairing the model’s ability to capture correct feature relationships, ultimately diminishing its ToM capabilities.

Discussion

Summary

이 논문의 논의 섹션에서는 **Large Language Models (LLMs)**의 Theory of Mind (ToM) 능력과 희소한 파라미터 구조 간의 근본적인 연관성을 밝혀내며, 사회적 추론 행동이 극히 국소화되고 저랭크 구조의 모델 가중치에 의해 결정됨을 강조한다. 특히 RoPE(Rotary Position Embedding)와 같은 위치 인코딩 기법이 ToM 관련 추론에 중요한 역할을 하며, ToM-sensitive parameters는 주요 주파수 활성화를 조절함으로써 주의 메커니즘의 기하학적 관계에 영향을 미치고, 결국 주의 중심(attention sink)의 이동을 유도함을 실험적으로 확인했다. 이는 LLM이 구조화된 위치 및 관계 표현을 통해 암시적 신념을 모델링하고 ToM 추론을 수행함을 시사한다.

또한, 연구는 ToM 능력이 사회적 추론의 핵심 요소임에도 불구하고, 현재의 실험은 예상치 못한 이전(unexpected transfer)과 예상치 못한 내용(unexpected contents) 태스크에 집중되었음을 인정하며, 향후 연구에서는 다양한 사회적 추론 능력을 평가하는 데 이와 유사한 파라미터 구조가 적용될 수 있을지 검토할 필요가 있음을 지적한다. 또한, ToM-sensitive parameters의 분석은 맥락적 위치 인식과 언어 이해에도 영향을 미치며, 이는 ToM 추론이 독립적인 인지 능력이 아니라 일반적인 토큰 위치 및 의미 구성 메커니즘의 부산물(emergent property)일 수 있음을 시사한다.

이러한 발견은 AI 해석 가능성과 제어 가능성에 중요한 함의를 제공하며, ToM-sensitive parameters를 식별하고 조작할 수 있는 능력은 고위험 분야(의료, 법적 분석, 인간-AI 협업 등)에서 사회적 추론 행동을 적응적으로 조절하는 모델 설계에 기여할 수 있음을 보여준다. 그러나 동시에, ToM 능력이 희소한 파라미터 집합에 집중되어 있다면 악의적인 간섭(adversarial intervention)을 통해 사회적 추론을 억제하거나 과장시킬 수 있는 위험 요소도 존재함을 경고한다.

향후 연구 방향으로는 모델 정렬(model alignment)을 위한 ToM-sensitive parameters의 활용, 인간 뇌의 ToM 신경 표현과의 비교, 로버스트성 및 공격 테스트에 대한 분석, 다중모달 설정(예: VQA)으로의 확장 등이 제시된다. 특히, VQA 연구에서 언급된 언어 사전(language prior)과 변동에 대한 견고성(robustness under perturbations) 분석은 본 연구의 작은 파라미터 변화가 추론 행동에 미치는 영향에 대한 관심과 일치함을 강조한다. 이러한 통찰은 딥러닝, 인지과학, AI 윤리 간의 경계를 허물며, LLM이 사회적 지능을 어떻게 획득하고 조작하는지를 이해하는 것이 투명성, 신뢰성, 인간 가치와의 일치를 보장하는 데 필수적임을 강조한다.

Our study uncovers afundamental link between sparse parameter structures and ToM capabilities in LLMs, demonstrating that social reasoning behaviors are governed by a highly localized and low-rank subset of model weights. A key insight from our findings is the pivotal role of positional encoding, particularly RoPE, in shaping ToM-related inferences. We observe that ToM-sensitive parameters modulate dominant frequency activations, influencing geometric relationships in the attention mechanism and ultimately shifting attention sinks. This mechanistic perspective suggests that LLMs leverage structured positional and relational representations to model implicit beliefs and perform ToM-related reasoning.

While our findings offer new insights into the structural basis of ToM reasoning in LLMs, our study does not exhaustively evaluate the full spectrum of ToM abilities. We primarily focus on unexpected transfer and unexpected contents tasks, which are among the most rigorous ToM benchmarks. However, future work is needed to assess whether similar parameter structures support a broader range of social reasoning skills. At the same time, our analysis of ToM-sensitive parameters provides a broader perspective, revealing their role beyond ToM tasks in contextual localization and language understanding. This suggests that ToM reasoning may not be an isolated cognitive faculty but rather an emergent property of general mechanisms underlying token positioning and meaning construction.

Beyond theoretical implications, our results raise important considerations for AI interpretability and controllability. The ability to identify and manipulate ToM-sensitive parameters opens avenues for designing models that can adaptively regulate their social reasoning behaviors—an essential feature for AI systems deployed in high-stakes domains such as healthcare, legal analysis, and human-AI collaboration. However, this structural localization also presents risks: if ToM capabilities are concentrated in a sparse parameter subset, adversarial interventions could be used to either suppress or exaggerate social reasoning, potentially leading to deceptive or manipulative AI behaviors.

Our findings also open several promising avenues for future investigation. Firstly, for targeted model alignment, can ToM-sensitive parameters be leveraged to ensure AI systems align with human ethical norms while mitigating unintended social biases? Secondly, for comparative cognitive

Fig. 11 | Example of attention sink shift from Llama 3-8B layer 0 head 6. The example sentence is the first several lines of T. S. Eliot’s long poem The Waste Land. Note that for visualization purposes, the attention values are not divided by the scaling factor before the softmax operation.

modeling, how do the identified sparse parameter structures compare to neural representations of ToM in the human brain? Could similar mechanisms underlie social reasoning across biological and artificial systems? Thirdly, for robustness and adversarial testing, if ToM capabilities depend on sparse and structured parameter subsets, could targeted attacks degrade LLM reasoning abilities? Understanding these vulnerabilities is critical for developing more resilient AI architectures.

While our work centers on ToM reasoning in LLMs, we acknowledge the potential to extend our framework to multimodal settings such as visual question answering (VQA) $^{24 - 26}$ . In VQA, some studies have explored the impact of language priors $^{27}$ (e.g., ESC-Net $^{28}$ ) and robustness under perturbations (e.g., R-VQA $^{25}$ ), which align with our broader interest in how small parameter changes affect reasoning behavior. These connections suggest possible directions for future work beyond the language-only setting.

By illuminating the structural underpinnings of social intelligence in AI, our study bridges the gap between deep learning, cognitive science, and AI ethics. As LLMs continue to evolve, understanding how they acquire, encode, and manipulate social reasoning will be essential for ensuring their transparency, reliability, and alignment with human values.

Methods

Summary

이 섹션에서는 ToM-sensitive parameter를 식별하기 위한 Hessian 기반 민감도 분석과 Fisher 정보 행렬을 활용한 희소한 파라미터 마스킹(binary mask) 기법을 제안한다. 먼저, ToM 학습 데이터셋 $D_{ToM-Train}$ 을 기반으로 손실 함수 $L$ 의 Hessian 행렬 $H (θ)$ 를 근사하기 위해 Empirical Fisher 정보 행렬 $\hat{F}$ 를 계산한다. 이때, $\hat{F}$ 의 대각선 요소를 기반으로 각 파라미터의 민감도(sensitivity)를 추정하며, 대각선 값이 클수록 모델 성능에 더 큰 영향을 미친다고 정의한다. 이후, ToM-sensitive parameter mask $m_{κ}$ 를 정의하기 위해 $\sum_{i = 1}^{d} m_{κ} (i) H_{ii}$ 를 최대화하는 희소한 이진 마스크를 생성한다. 이 마스크를 적용하면 ToM 성능이 저하되지만, 언어 처리 능력도 함께 약화되는 문제가 발생함을 실험적으로 확인하였다. 이를 해결하기 위해 사전 학습 데이터셋 $D_{pre-training}$ 을 활용해 언어 모델 성능에 영향을 주는 파라미터를 식별한 제2의 마스크 $m_{κ}^{'}$ 를 도입하고, 최종적으로 ToM-specific 마스크 $m_{κ}^{''} = m_{κ} ⊙ \overline{m_{κ}^{'}}$ 를 정의하여 ToM 관련 파라미터만 선택적으로 간섭하도록 설계하였다. 이 과정에서 $κ$ 는 마스크의 희소도 비율(0 ≤ κ ≤ 1)을 조절하는 하이퍼파라미터로, 0.001% 수준의 극히 희소한 파라미터를 타겟팅하는 것이 핵심 기여 요소로 밝혀졌다.

Sparse ToM-sensitive parameter patterns

Summary

이 섹션에서는 ToM-민감 파라미터를 식별하기 위한 희소한 파라미터 패턴을 분석하는 방법을 제시한다. 연구팀은 Fisher 정보 행렬을 기반으로 이진 마스크 $m_{κ}$ 를 도출하여, ToM 관련 작업에 민감한 파라미터를 분리하는 방식을 제안한다. 이후, 전체 언어 처리 성능을 유지하기 위해 사전 학습 데이터셋 $D_{pre-training}$ 을 활용한 성능 마스크 $m_{κ}^{'}$ 와 결합해, ToM-민감 파라미터를 식별하는 최종 마스크 $m_{κ}^{''} = m_{κ} ⊙ \overline{m_{κ}^{'}}$ 를 정의한다. 이 마스크는 Hessian 행렬 $H (θ)$ 의 대각 요소 $H_{ii}$ 에 기반한 민감도 최대화를 통해 생성되며, 초기 적용 시 ToM 능력 저하와 함께 ** perplexity 증가**를 유발하는 문제를 해결하기 위해 언어 모델의 핵심 파라미터를 보존하는 방식으로 설계되었다. 특히, ** $m_{κ}$ **는 ToM 관련 작업에 기여하는 파라미터를 0.001% 수준의 희소성으로 선택하고, ** $m_{κ}^{'}$ **는 언어 처리 성능에 필수적인 파라미터를 제외함으로써, ToM 성능에만 영향을 주는 간섭을 가능하게 한다. 이 과정에서 Fisher 정보 행렬은 $F = \frac{1}{n} \sum_{i = 1}^{n} g_{i} g_{i}^{⊤}$ 로 근사되며, 대각 요소만을 활용해 각 파라미터의 민감도를 추정한다.

In this subsection, we identify sparse parameter patterns critical for ToM capabilities. Using the Fisher information matrix, we derive a binary mask $m_{κ}$ to isolate ToM-sensitive parameters. We further combine this with a pretraining task performance mask $m_{κ}^{'}$ to ensure perturbations specifically impair ToM capabilities without degrading overall language performance.

We start with an introduction of Fisher information matrix. Let $D_{ToM-Train} = {(x_{i}, y_{i})}_{i = 1}^{n}$ be a dataset with loss

$L (θ; D_{ToM-Train}) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (θ; x_{i}, y_{i}) .$

In the later stage of training, the first-order gradient term of the loss $L$ is nearly zero, so the second-order term, governed by the Hessian matrix, primarily determines how the loss increases under small parameter perturbations29. We denote the Hessian of the loss $L$ at parameters $θ$ by $H (θ)$ . In practice, this Hessian is often approximated by the Fisher information matrix F, which can be estimated via the empirical Fisher $\hat{F}$ . Concretely, let $g_{i} = \nabla_{θ} ℓ (θ; x_{i}, y_{i})$ , then in the late-training regime, we approximate the overall gradient and Hessian of $L$ by

$\nabla_{θ} L (θ) \approx \frac{1}{n} \sum_{i = 1}^{n} g_{i}, H \approx F \approx F = \frac{1}{n} \sum_{i = 1}^{n} g_{i} g_{i}^{⊤} .$
(1)

In practical scenarios, we further simplify $F$ by ignoring its off-diagonal elements, focusing only on the diagonal entries as a per-parameter sensitivity estimates30,31. Under this approximation, larger diagonal values indicate that the corresponding parameters have a greater impact on the model’s performance32.

Next, we showcase how to identify ToM-sensitive parameter patterns. Let d be the number of parameters in the current layer or matrix being analyzed. We seek a sparse binary mask $m_{κ} \in {0, 1}^{d}$ with exactly $κ d$ nonzero entries ( $κ \in [0, 1]$ is the proportion) such that it maximizes the total sensitivity.

Definition 1. (ToM-sensitive Parameters). Using the Hessian H from Equation (1), a sensitive parameter mask $m_{κ} \in {0, 1}^{d}$ with $κ d$ nonzero entries is defined by

$m_{κ} = m_{κ} \in {0, 1}^{d} max \sum_{i = 1}^{d} m_{κ} (i) H_{ii} .$

We applied $m_{κ}$ directly to the model and observed that while the model’s ToM capabilities diminished, the model’s perplexity also increased

significantly. We hypothesize that this occurs because $m_{κ}$ includes not only parameters relevant to ToM-related tasks but also those essential for maintaining the model’s language processing capabilities.

Inspired by 33,34, to prevent degradation of these language capabilities, we employ another dataset $D_{pre-training}$ to derive $m_{κ}^{'}$ , identifying parameters critical for overall language modeling performance. The final ToM-sensitive pattern is then defined as:

$m_{κ}^{''} = m_{κ} ⊙ \overline{m_{κ}^{'}}$

Here, $\overline{m_{κ}^{'}}$ represents the complement of $m_{κ}^{'}$ , and $⊙$ denotes element-wise product. This formulation isolates parameters specifically sensitive to ToM tasks while preserving those vital for language processing, ensuring that applying $m_{κ}^{''}$ impairs ToM capabilities without substantially affecting the model’s overall linguistic performance.

Rotary positional encoding

Summary

이 섹션에서는 Transformer 디코더 기반 모델에서 널리 사용되는 Rotary positional encoding (RoPE) 메커니즘을 소개하며, 그 수학적 정의와 특성을 분석한다. RoPE는 Q 및 K 활성화의 특징 쌍에 토큰 위치에 따라 회전을 적용하는 방식으로, **각 특징 차원 $m$ 에 대해 회전 각도 $θ (p, m) = p \cdot (\frac{1}{50000})^{\frac{2 m}{d _{h}}}$ **을 정의하고, 이에 따라 **2차원 회전 행렬 $M (p, m)$ **을 통해 $x_{n}^{p}$ 벡터를 변환한다. 이 과정에서 낮은 인덱스의 차원은 높은 주파수, 높은 인덱스의 차원은 낮은 주파수를 나타내며, $Q = X W_{Q}$ 및 $K = X W_{K}$ 의 저주파 성분이 더 높은 크기를 가지는 경향이 관찰된다. 이는 저주파 차원이 더 천천히 회전하여 장거리 의존성에 대한 정보를 안정적으로 인코딩할 수 있기 때문으로, 이 현상은 RoPE를 사용하는 모델에서만 발생하는 것으로 확인되었다.

For Transformer decoder-based models, a widely used positional encoding method is RoPE16.

We start by introducing RoPE and feature frequencies. RoPE applies token position-dependent rotations to feature pairs in activations Q and K. Formally, RoPE defines a rotational encoding angle as:

$θ (p, m) = p \cdot (\frac{1}{50000})^{\frac{2 m}{d _{h}}},$

where p is the token position, m is the feature index within an attention head, $d_{h}$ denotes the per-head feature dimension. The encoding applies a rotation matrix M(p, m) to each feature pair $x_{n}^{p} \in R^{2}$ :

$Enc (x_{p}^{m}, p, m) = [cos (θ (p, m)) sin (θ (p, m)) - sin (θ (p, m)) cos (θ (p, m))] \cdot x_{p}^{m}$
$= M (p, m) \cdot x^{m} .$

Given two token activations $q_{i}, k_{j} \in R^{d_{h}}$ , their RoPE-encoded activation interaction is:

$RoPE (q_{i}, k_{j}) = \sum_{m = 0}^{d_{h} /2 - 1} (Enc (q_{i}^{m}, i, m))^{⊤} \cdot Enc (k_{j}^{m}, j, m)$
$= \sum_{m = 0}^{d_{h} /2 - 1} (q_{i}^{m})^{⊤} \cdot M (j - i, m) \cdot k_{j}^{m} .$

This formulation shows that RoPE assigns smaller encoding angles to later feature dimensions in Q and K, meaning that these dimensions rotate more slowly across token positions. As a result, lower-indexed dimensions correspond to higher frequencies, while higher-indexed dimensions correspond to lower frequencies in the positional encoding.

Recent studies $^{35, 36}$ have shown that activations tend to concentrate at certain frequencies, with low-frequency components of $Q = X W_{Q}$ and $K = X W_{K}$ exhibiting higher magnitudes. One possible explanation is that low-frequency dimensions rotate more slowly, which may allow them to encode information more stably over longer token dependencies $^{35}$ . We observe that this phenomenon occurs specifically in models using RoPE, while it is absent in models without RoPE.

Data availability

Summary

이 섹션에서는 본 연구에서 개발한 모델이 Hugging Face 플랫폼을 통해 공개적으로 제공됨을 밝히며, 실험에 사용된 평가 데이터셋은 https://osf.io/csdhb/ 주소를 통해 접근 가능함을 설명한다. 또한, 본 논문과 관련된 코어 코드는 출판 후 오픈소스로 공개할 예정이며, 이전까지는 해당 연구의 결과를 지원하는 코드를 접근하기 위해 соответствующий 저자에게 합리적인 요청을 제출할 수 있다고 명시한다. 모델 아키텍처 및 데이터셋 구성에 대한 상세한 설명은 보충 자료에서 제공된다.

The models developed for this study are publicly available on the Hugging Face platform. The evaluation datasets used in our experiments can be accessed at https://osf.io/csdhb/. Detailed descriptions of the model architectures and dataset configurations are provided in supplementary information. We plan to release the core code associated with this manuscript as open-source uponpublication. In the interim, researchers may access the code supporting this

study’s findings by submitting a reasonable request to the corresponding author.

Code availability

Summary

이 섹션에서는 본 연구의 코드가 공개적으로 제공되는 주소를 명시하며, 해당 코드는 https://github.com/joel-wu/how-large-language-models-encode-theory-of-mind에서 접근 가능함을 밝힌다. 또한, 논문은 2025년 4월 5일에 수신되어 2025년 8월 4일에 최종 수락되었음을 명시하며, 연구의 재현성과 투명성을 강조한다.

The code supporting the findings of this study is publicly available at: https:// github.com/joel-wu/how-large-language-models-encode-theory-of-mind.

Received: 5 April 2025; Accepted: 4 August 2025;

Juhyeon's Blog

탐색기

How large language models encode theory-of-mind: a study on sparse parameter patterns

Results

ToM tasks for LLMs

Methods and findings overview

Sensitivity to perturbations and its impact on ToM and language processing

Characteristics of ToM-sensitive parameters and their impact on positional encoding

From positional encoding to attention map

Discussion

Methods

Sparse ToM-sensitive parameter patterns

Rotary positional encoding

Data availability

Code availability

그래프 뷰

목차

Properties

백링크