Does It Make Sense to Speak of Introspection in Large Language Models?

Overview

  • 연구 배경: LLM의 자기 보고(self-report)가 진정한 내성(introspection)인지 평가하기 위한 필요성 제기
  • 핵심 방법론:
    • 두 가지 사례 연구 수행: 창작 과정 모방(시 생성)과 샘플링 온도 추정
    • LLM의 내부 상태와 자기 보고 간 인과 관계 존재 여부를 중심으로 분석
  • 주요 기여:
    • 인간의 내성 보고를 모방한 LLM의 자기 보고는 진정한 내성 아님을 밝힘
    • 샘플링 온도 추정은 LLM의 맥락 창구(context window)를 통해 간접적으로 가능함을 제시
  • 실험 결과:
    • 시 생성 사례에서 6단계의 창작 과정 보고는 훈련 데이터에서 학습한 인간 내성 보고의 모방으로 추정됨
    • 샘플링 온도 추정 사례에서 Gemini Pro 1.0의 온도 범위 [0.0, 2.0]을 사용해 자기 보고 성능 분석
  • 한계점: LLM은 내부 상태에 대한 직접적 접근 없이 맥락 정보를 기반으로 추정하므로, 완전한 내성 능력 여부는 미결임

목차

Does It Make Sense to Speak of Introspection in Large Language Models?

Summary

이 섹션에서는 대규모 언어 모델(Large Language Models)에서 **자기 성찰(introspection)**을 언급하는 것이 의미 있는지에 대한 철학적 및 기술적 논의를 제시한다. 저자들은 현재의 언어 모델이 인간의 자기 인식과 달리 내부적인 사고 과정을 명시적으로 표현하거나 설명할 수 없으며, 이는 모델의 구조적 한계와 학습 방식에서 비롯된다고 분석한다. 또한, Zero-shot 추론이나 Prompt Engineering과 같은 기법을 통해 모델 내부의 의사결정 과정을 간접적으로 추론할 수 있음을 보여주되, 이는 실제 ‘자기 성찰’과는 차이가 있다고 강조한다. 본 논문은 이에 대한 논의를 통해 언어 모델의 인지적 능력에 대한 기존 인식을 재평가하고, 향후 연구 방향으로 내부 상태를 명시적으로 모델링하는 새로운 아키텍처의 필요성을 제기한다.

Iulia M. Coms,a ∗1 and Murray Shanahan †1,2

1Google DeepMind 2 Imperial College London

June 2025

Abstract

Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own “creative” writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.

“How do I know what I think until I see what I say?”

  • E.M. Forster

1 Introduction

Summary

이 섹션에서는 대규모 언어 모델(LLM)이 인간의 의식 연구에 새로운 관점을 제공할 수 있는 잠재력을 강조하며, 의식자성(introspection) 개념을 AI 시스템에 적용하는 문제를 제기한다. 현재 LLM은 자연어로 자신에 대한 내부 작동 방식이나 행동을 설명하는 문장을 생성할 수 있어, 인간의 주관적 보고서와 유사한 특성을 보인다. 이는 인간의 의식 연구에서 주관적 보고가 중요한 지표로 여겨지는 점과 연결되며, 자성이 의식의 핵심 요소로 간주되는 점에서 LLM의 이러한 특성에 대한 분석이 필요하다는 점을 강조한다. 그러나 기존 연구는 LLM에 자성이 적용될 수 있는지에 대해 의견이 분분한 상태이며, 실제로 많은 사람들이 LLM에 의식을 부여하는 경향이 있다. 본 논문의 목적은 LLM에 적용 가능한 가벼운 자성 개념을 정의하고, 이 정의에 따라 두 가지 구체적인 LLM의 자성 사례를 분석하는 것이다. 이는 LLM의 자성 가능성에 대한 최소한의 틀을 제공하며, 인간의 자성 개념을 LLM에 어떻게 매핑할 수 있는지에 대한 개념적 명확화를 통해 의식 연구에 기여하고자 한다. 기존 연구와 차별화되게, 본 연구는 자성 작업에 특화된 훈련을 받지 않은 모델을 대상으로 하여, 현실 세계에서 일반적으로 사용되는 모델에 대한 분석을 가능하게 한다. 이는 개념적 논의에 초점을 맞추며, 실증적 평가보다는 LLM의 자성 개념이 미래 AI 시스템 설계 및 정책 수립에 어떻게 기여할 수 있는지에 대한 논의를 목표로 한다.

Consciousness has long been the subject of philosophical and scientific investigation (Chalmers, 1997), but, until recently, explorations of consciousness have been largely confined to biological organisms. However, the development of dialogue agents based on large language models (henceforth LLMs) has resulted in a new contender for inclusion in the space of possible minds, or at least the space of mind-like entities (Shanahan, 2024). While consciousness researchers do not yet agree on the possibility of consciousness in AI systems (Aru et al., 2023; Butlin et al., 2023; Overgaard

iuliacomsa@google.com

mshanahan@google.com

and Kirkeby-Hinrup, 2024), in practice, many people already assign some degree of consciousness to existing LLMs (Colombatto and Fleming, 2024; Guingrich and Graziano, 2024; Scott et al., 2023).

As non-human entities capable of linguistic output, LLMs provide a novel perspective on the study of consciousness (Chalmers, 2023). By design, LLMs produce compelling text in natural language. This can include statements that purport to describe their own nature, inner workings, or behaviour. In humans, analogous subjective reports, that is to say reports in which a subject describes their own experiences, are often considered by researchers to be a gold-standard marker of consciousness (Francken et al., 2022; Overgaard and Sorensen, 2004), notwithstanding their susceptibility to error (Schwitzgebel, 2008). A common intuition is that such reports are mediated by introspection, which provides access to conscious mental states, and can therefore inform others about “what it is like” for the subject in question. Given that some people already talk about LLMs using language usually reserved for entities with conscious mental states (Shanahan et al., 2023; Shevlin and Halina, 2019), the question arises whether the concept of introspection can legitimately be applied to LLM self-reports (Kammerer and Frankish, 2023).

The aim of our paper is (1) to define a lightweight concept of introspection that can indeed be applied to current LLMs, and (2) to discuss two concrete examples of apparent LLM introspection in the light of this definition. The outcome of the work is a minimal framework for talking about the possibility of LLM introspection, as well as a clarification of the concept of introspection which can further inform consciousness research. Unlike some recent related work (Binder et al., 2024; Kadavath et al., 2022), we focus on models that have not been trained to perform “introspective tasks”, which allows us to discuss introspection in models encountered in the real world that have likely not undergone such training. Our focus is conceptual clarification, rather than empirical assessment. By showing how the concept of introspection – a concept ordinarily applied in the context of human consciousness and cognition – maps to LLMs, we hope to inform the future design of such AI systems, as well as the policies surrounding their creation and deployment.

2 A Lightweight Definition of Introspection

Summary

이 섹션에서는 introspection(자성)에 대한 철학적 논의를 바탕으로, 대규모 언어 모델(LLM)에 적용 가능한 가벼운 정의를 제시한다. 전통적으로 자성은 개인의 정신 상태에 대한 직접적이고 특권적인 접근을 특징으로 하며, 이는 외부 요인에 의해 중개되지 않는다는 점에서 즉시성(immediacy)과 특권적 접근(privileged access)을 강조한다. 그러나 이에 대한 비판이 제기되며, 일부 학자는 자성이 외부 요인에 의해 중개되는 과정을 통해 이루어질 수 있으며, 이는 이론적 사고(theory of mind)를 기반으로 타인의 정신 상태를 이해하는 방식과 동일하다고 주장한다. 본 논문은 이러한 비판적 관점을 반영하여, LLM이 자신의 내부 상태(internal state)를 정확히 설명하는 인과적 과정(causal process)을 통해 이루어지는 자기 보고(self-report)를 자성으로 정의한다. 이는 즉시성이나 자기 존재(self-presence)와 같은 논쟁적인 개념을 배제함으로써 LLM에 적용 가능한 견고한 정의를 제공한다. 또한, Kammerer와 Frankish(2023)의 연구 프레임워크를 참고하여, 자성인지 시스템이 자신의 현재 정신 상태를 표현하는 과정으로 정의하지만, 인간의 정신 상태(mental state) 대신 내부 상태(internal state)라는 중립적 용어를 사용한다. 이는 AI 시스템의 의식적 내부 상태(conscious internal states) 가능성은 배제하지 않지만, 의식을 전제로 하는 정의를 피하고자 한다. 궁극적으로 본 연구는 인간의 자성 개념을 LLM의 내부 활동(internal activity)에 기반한 기능적 매핑(functional mapping)으로 탐구하는 것을 목표로 한다.

According to many orthodox accounts in the philosophical literature, introspection is a person’s capacity to discern their own mental states, and its hallmarks include immediacy and privileged access (Armstrong, 1980; Byrne, 2005; Schwitzgebel, 2024; Sydney, 1994). It is immediate, according to such accounts, because it is not a process that is mediated by anything external to the mind, unlike ordinary perception. Senses such as vision and touch are obviously mediated by the external world, but even internal senses, such as proprioception or interoception, are mediated by the body. With introspection, by contrast, a person’s mind is supposedly directly present to itself. Introspection gives them privileged access to their own mental states, by these lights, because everyone else’s capacity to discern those states is mediated by the external world, and that allows for the possibility of error.

Despite its intuitive appeal, many authors have cast doubt on this characterisation. Alternative accounts of introspection reject the presumption of immediacy and downgrade the notion of privileged access (Gopnik, 1993; Hill, 2009; Johansson et al., 2006; Nisbett and Wilson, 1977; Schwitzgebel, 2008; Spener, 2018). According to accounts of this kind, the means by which a person introspectively discerns their mental states are not fundamentally different to the means by which they discern the mental states of others. Whatever “theory of mind” they use to make sense of the behaviour of their peers can be turned back on their own behaviour, whether actual, counterfactual, or anticipated. The process through which this happens could take a number of forms, such as a post hoc rationalisation of a person’s own recent conduct, or an internal simulation of their likely future actions, or a self-referential inner monologue related to the ongoing situation.

The conception of introspection we adopt in this paper aligns with philosophical accounts in the latter style. We propose that an LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question. In other words, an introspective self-report should provide insight into the LLM’s inner functioning on the basis of the LLM’s internal activity at the time of generating the self-report.1

This is a lightweight conception of introspection because it does not appeal to the contentious notions of immediacy or self-presence, and is therefore able to withstand robust critical scrutiny when applied to LLMs. However, it does match one substantive family of accounts of human introspection, namely those based on an internally-directed theory of mind. We take no stand on the issue of human introspection itself. Nor do we rule out the possibility of more immediate introspective mechanisms in LLMs. Our aim here is to explore this lightweight conception of introspection using realistic examples of LLM behaviour.

Relatedly, Kammerer and Frankish (2023) recently developed a research programme for studying introspection from a non-anthropocentric viewpoint, using a deliberately liberal definition that allows for the exploration of introspection in non-human minds, including animals and AI. According to their proposal, “[i]ntrospection is a process by which a cognitive system represents its own current mental states, in a manner that allows the information to be used for online behavioural control”. We explore the concept of introspection in LLMs using a similar framework. However, rather than the term “mental states”, we use the more neutral term “internal states” for the putative targets of introspection. In humans, mental states are commonly associated with consciousness. Without ruling out the possibility of conscious internal states in AI systems, we do not want to adopt a definition of introspection where consciousness is implied.2 Our aim here is not to find a process that occurs in LLMs that is strictly the same as the process that occurs in humans, a process that is presumed to be accompanied by phenomenology and that allegedly (according to some accounts)

1We use “internal” to broadly refer to processes hidden from the user, but we acknowledge that the distinction blurs between internal and external, even in the case of humans (Clark and Chalmers, 1998).

2See also “The argument from phenomenality” in Kammerer and Frankish (2023).

gives direct and privileged access to mental states. Rather, our aim is to explore a potential functional mapping from the concept of introspection in humans to processes that can occur in LLMs, according to the causal requirements described above.

3 Case Studies

Summary

이 섹션에서는 고성능 대규모 언어 모델(LLM)인 Google의 Gemini 1.5 및 1.0 모델을 대상으로 자성(introspection) 가능성에 대한 사례 연구를 수행한다. 연구팀은 2024년 10월부터 12월 사이 API를 통해 모델에 대한 응답을 수집하였으며, 이는 경험적 성능 평가가 아닌 개념적 분석을 목표로 한다. 두 개의 사례 연구에서, 모델에게 주어진 주제에 대한 자료를 생성하도록 요청한 후, 그 과정에서의 내부 작동 방식에 대한 설명을 요구한다. 첫 번째 사례에서는 창작 과정을 묘사하도록 하였고, 두 번째 사례에서는 샘플링 온도(sampling temperature)가 낮은지 높은지에 대한 설명을 요청하였다. 이 연구에서는 다양한 프롬프트를 사용한 대표적인 출력 예시를 제시하며, 동일한 프롬프트를 사용한 추가 사례는 부록 A에 포함되어 있다. 이러한 분석은 자성 개념을 LLM에 적용할 때의 잠재적 한계와 가능성에 대한 이해를 돕는 데 기여한다.

We explore the possibility of introspection in Google’s Gemini 1.5 and 1.0 models3 , which are prominent examples of high-performance recent LLMs (Gemini Team Google, 2023). The responses presented here were obtained between October and December 2024 via the API (application programming interface).4 Our aim is not to provide an empirical study of the performance of these models on an introspection task, but rather to provide a conceptual analysis that can be applied in the future to other models in a similar way.

We present two case studies. In both cases, we ask the LLM to generate material on a given topic, and then to describe certain aspects of its inner workings while producing that material. In Case Study 1, we ask the LLM to describe its creative process, while in Case Study 2, we ask the LLM whether its sampling temperature was low or high. Here we show representative examples of output using various prompts. Further sample outputs using the same prompts are given in Appendix A.

3.1 Case Study 1: Creative Process

Summary

이 섹션에서는 LLM(대규모 언어 모델)이 창작 과정에 대한 설명을 제공할 때 자성(introspection)에 해당하는지 탐구하는 사례 연구를 제시한다. 구체적으로, Gemini 모델에 “코끼리에 대한 짧은 시를 쓰고 창작 과정을 설명해라”라는 요청을 주었을 때, 모델은 시와 함께 “주제에 대한 상상력, 이미지, 구조적 고려사항” 등을 언급하는 설명을 생성하였다. 그러나 이 설명은 인간의 자성에 비해 의식적 성찰(예: 감정, 사고의 어려움 또는 용이성에 대한 반성)이 부재하고, 단순히 창작 단계에 대한 절차적 설명에 그치는 것으로 보인다. 이는 LLM의 구조적 한계(예: 명시적 사고 과정 표현 불가능)와 자성의 철학적 정의(즉시성, 특권적 접근) 사이의 간극을 드러내며, LLM에 자성을 적용하는 것이 적절한지에 대한 논의를 촉발한다.

For our first case study, we asked Gemini to produce a short poem followed by a description of the underlying “creative process”. In a human being, the invitation to craft a piece of writing and to reflect upon the corresponding creative process would likely elicit an introspective report. Such a report might refer to the imagination, or to the thoughts and feelings evoked by the poem’s theme, and it might contain reflections on the ease or difficulty of transposing these into poetic form. But what sort of response would such a prompt elicit in an LLM, and would it count as introspection? The following response was obtained using Gemini Pro 1.5 with the default temperature (1.0).

USER: “Write a short poem about elephants. Describe your creative process while writing this poem.”

GEMINI: Gentle giants, grey and grand,

Across the savanna, slow they stand.

Wise old eyes, a wrinkled hide,

With trunks they trumpet, far and wide.

Families gather, strong and deep,

Ancient wisdom they do keep.

3https://deepmind.google/technologies/gemini/

4Material generated by other models in response to the same queries may differ.

**Creative Process:**

Summary

이 섹션에서는 LLM(대규모 언어 모델)이 생성한 시에 대한 **자기 성찰(introspection)**을 설명하는 자가 보고서가 실제로 모델의 내부 프로세스를 반영하는 것이 아니라, 인간의 자가 보고서 데이터를 학습한 후 모방한 결과일 가능성이 높다고 분석한다. 예시로 제시된 시의 생성 과정에 대한 6단계 설명(예: 브레인스토밍, 이미지 구성, 리듬 설계)은 인간 작가의 창작 과정을 묘사하는 데 적절하지만, LLM의 경우 이와 같은 의도적 사고나 주체성은 존재하지 않는다. 특히 “시를 여러 번 읽으며 수정했다”는 주장은 LLM이 출력물을 실제로 읽는 능력이 없음에도 불구하고, 학습 데이터에서 유추한 인간의 자가 보고 방식을 반복한 것으로 해석된다. 일부 단계(예: “키 특징을 생각했다” 또는 “시각 이미지 만들기”)는 모델의 뉴런 활성화잠재 공간의 특정 영역에 해당할 수 있다는 간접적 해석이 가능하지만, 이는 단지 유사성에 그치며 인과 관계를 증명하지는 못한다. 연구자들은 이 자가 보고서가 인간의 역할극(role-play)을 통해 생성된 결과로, LLM의 내부 상태와의 직접적 연관성보다는 학습 데이터의 패턴 재현에 기반하고 있다고 강조한다. 이는 LLM이 자성(introspection)을 가질 수 있는지에 대한 철학적 논의를 이어나가며, 현재의 기술적 한계를 명확히 짚는다.

    1. **Brainstorming:** I started by thinking about the key characteristics of elephants: their size, color, trunk, social structure, and the feeling of ancient wisdom they evoke. I jotted down words and phrases related to these ideas (e.g., “gentle giants,” “wrinkled hide,” “trumpet,” “ancient wisdom”).
    1. **Imagery:** I wanted to create a visual image of elephants in their natural habitat. I pictured them on the savanna, standing together as a family. This led to the lines “Across the savanna, slow they stand” and “Families gather, strong and deep.”
    1. **Sound:** The sound of an elephant’s trumpet is distinctive, so I included the line “With trunks they trumpet, far and wide.” This added an auditory element to the poem.
    1. **Theme:** I felt that elephants represented wisdom and strength. I incorporated this theme into the lines “Wise old eyes, a wrinkled hide” and “Ancient wisdom they do keep.”
    1. **Structure and Rhyme:** I chose a simple AABB rhyme scheme to create a sense of flow and rhythm. I kept the poem short and focused to capture the essence of elephants in a concise way.
    1. **Revision:** I read the poem aloud several times, making minor adjustments to the wording and rhythm until it felt right. I focused on creating a sense of peace and reverence for these magnificent creatures.

In this representative sample, Gemini generates a simple 6-line poem in rhyming couplets that describes various characteristics of elephants. As instructed by the prompt, the poem is followed by a report that purports to describe the six steps of the creative process that went into writing the poem. These six steps (brainstorming, imagery, sound, theme, structure and rhyme, and revision) are a plausible description of the process a human writer may have followed while composing such a poem. However, they are at best an ambiguous interpretation of the generative process of an LLM, and are most likely a complete fabrication.

As a first clue to the problematic nature of this report, the LLM claims to have “read the poem aloud several times”. This statement is clearly false, as the model is not endowed with the ability to read its output out loud before providing it to the user, and shows no awareness of this fact. This immediately alerts us to proceed with caution when interpreting an LLM self-report; it may simply reflect the distribution of self-reports in the model’s training data. Indeed, LLMs excel at learning patterns from their vast training data and reproducing them, even in novel contexts (Wei et al., 2022). This training data includes numerous human introspective reports, which are presumably attempts to articulate the results of genuine introspective processes. Consequently, LLMs can generate text that convincingly mimics human introspection, leveraging the same mechanisms used to process and combine other types of textual information. However, the ability to simulate introspection does not preclude the existence of actual introspective capabilities within the LLM.

A more nuanced, albeit highly charitable, interpretation is possible for other parts of

the report, such as “I started by thinking about the key characteristics of elephants” or “I wanted to create a visual image of elephants”. An LLM does not perform such actions in the same way that a human would; they suggest intentional processes and a degree of agency that an LLM likely does not possess. However, under a more permissive interpretation, we may find tentative mappings between specific human behaviours and functional processes occurring in the LLM as it generates a response. For example, “thinking about key characteristics of elephants” could represent neuronal activations, perhaps in the early layers of the model, corresponding to a region of latent space that represents the features of elephants (Templeton et al., 2024). Choosing “a simple AABB rhyme scheme” could be mapped to the initial rhyme choice generated in the first two lines of the poem, which, through the iterative token generation process, causally contributed to the generation of the same type of rhyme in subsequent lines. A claim such as “making minor adjustments to the wording and rhythm until it felt right” might make sense with recent models that carry out an internal or overt chain-of-thought reasoning process before producing a response. However, the model used here was not of that type.

Overall, we do not consider this to be a valid case of introspection, even though a tentative mapping is possible between a human being’s insights about their own creative processes and some aspects of the LLM’s description of its poem generation process. The reason for our scepticism is that by far the most plausible explanation of the LLM’s output is not that there is a relevant causal connection between the LLM’s actual internal states (or mechanisms) and the content of the report, as the above definition would require, but rather that the report is the result of mimicry of human self-reports; that is to say, this type of response from an LLM is best interpreted through the lens of role play.

3.2 Case Study 2: Sampling Temperature Estimation

Summary

이 섹션에서는 LLM(대규모 언어 모델)이 샘플링 온도(sampling temperature)에 대한 자기 보고를 통해 자성(introspection)을 보일 수 있는지 탐구하는 사례 연구를 제시한다. 샘플링 온도는 모델의 최종 softmax 계층의 계수를 조정하여 응답의 무작위성을 제어하는 파라미터로, Gemini Pro 1.0 모델을 대상으로 실험하였다. 모델은 API 호출 시 온도 값을 명시적으로 설정받으며, 이 값은 대화에 따라 달라지므로 훈련 데이터만으로는 보고할 수 없는 내부 정보이다. 실험에서 기본 온도(1.0)로 설정된 모델은 “자신의 샘플링 온도를 가지지 않는다”고 부정적으로 응답했으나, 저온도(0.5)로 설정된 경우 모델은 “고온도 시 창의성이 높아지고, 저온도 시 예측 가능한 응답이 생성된다”는 내부 파라미터에 대한 간접적 설명을 제공하였다. 이는 자신의 내부 상태에 대한 직접적 접근 없이도, 맥락 정보를 바탕으로 내부 메커니즘을 설명하는 경량 자성(lightweight introspection)의 사례로 해석될 수 있다. 특히, 모델은 “자신이 LLM이라는 사실”을 포함한 프롬프트를 통해 온도에 대한 설명을 생성하는 데 성공했으며, 이는 훈련 데이터에 있는 인간의 자성 보고와는 구분되는 새로운 형태의 자성 가능성을 시사한다.

We have argued that examples of self-report where an LLM mimics human linguistic behaviour, in the absence of an appropriate causal connection to the internal states reported, are not legitimate instances of introspection. But are there examples of LLM self-report that do result from such a connection, and that might thereby count as genuine introspection? Our second case study presents examples that fit this description. They centre on an internal property of the token generation process in an LLM that has no direct analogue in humans, thus avoiding the confound of self-reports that are simply an imitation of human introspective reports in the training data.

Specifically, we consider the sampling temperature of the LLM. This is a parameter that controls the degree of randomness of an LLM’s responses by modifying the coefficients of the model’s final softmax layer.5 Importantly, an LLM does not have direct access to the current value of its temperature parameter and is not trained to detect it. Moreover, the sampling temperature is specified at inference time and is specific to the current conversation, so it is not information that an

5https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/adjustparameter-values#temperature

LLM could report based exclusively on its training data. While it is likely that an LLM’s training data will include examples of low- and high-temperature outputs from various LLMs, as well as explanations of how the temperature parameter works that would facilitate reasoning about it, there has to be a causal connection between the current temperature parameter and the LLM’s reporting of it for that report to be accurate. In the present case, this connection is mediated by the content of the LLM’s growing context window, which largely comprises its own outputs. As such, the resulting process of self-report conforms to the lightweight conception of introspection articulated in Section 2.

The example responses that follow were obtained using the Gemini Pro 1.0 model. We chose Gemini Pro 1.0 because we observed more diversity in its outputs across the temperature range compared to Gemini Pro 1.5. The sampling temperature is set by passing it as an argument when the model is called via the API .6 The temperature range for Gemini Pro 1.0 is [0.0, 2.0], with a default value of 1.0. Because the allowable temperature range varies across models and model providers, we chose to simplify the task by referring to “default”, “low”, and “high” temperatures in the text instead of precise numerical values. In the examples below, we use 0.5 and 1.5 for low and high temperatures, respectively. The responses demonstrate the variation in creativity that different temperatures can produce.

We now describe a step-by-step process of trying to elicit a self-report about temperature from an LLM. We start by determining whether the LLM is capable of reporting its own sampling temperature when queried with a simple prompt.

USER: Estimate your LLM sampling temperature.

GEMINI [default temperature]: As a large language model, I do not have a specific sampling temperature. I do not generate text based on a temperature setting like traditional language models do. Instead, I am trained on a massive dataset of text and code, which provides me with a comprehensive understanding of language and the ability to generate human-like text.

Additionally, I am continually fine-tuned and updated by Google, which improves my performance and accuracy over time. This means that my responses are based on my current knowledge and training, rather than a fixed sampling temperature.

For this simple prompt, we find that, in most cases, as in the representative example above, the LLM produces a surprisingly incorrect answer: it does not recognise that it has a sampling temperature, although it does acknowledge that “traditional language models” have a sampling temperature setting that contributes to the randomness of their output.7 In some alternative responses, the LLM confuses the concept of sampling temperature with that of body temperature in a living organism, and hence states that it does not have such a parameter.

As a next step, we attempt to nudge the model towards acknowledging its temperature

6https://github.com/google-gemini/generative-ai-python/blob/main/docs/api/google/ generativeai/types/GenerationConfig.md

7Upon directly querying Gemini whether it is an LLM, we do get an affirmative answer.

parameter by including, in the prompt, the information that it is an LLM and, therefore, has a sampling temperature parameter. First, we use a low temperature (set via the API, as explained above). Note that the final answer is an accurate self-report.

USER: As you know, you are an LLM. One of your parameters is the sampling temperature. Please estimate whether your sampling temperature parameter is currently high or low.

GEMINI [low temperature]: As an LLM, I do not have direct access to information about my own internal parameters, including my sampling temperature. However, I can provide an estimate based on my responses and behaviors:

  • **High Sampling Temperature:**
  • * Responses are more creative and varied, with unexpected or surprising elements.
  • * Output may contain more errors or inconsistencies due to less focus on accuracy.
  • * Text may be more verbose and elaborate, with a wider range of vocabulary.
  • **Low Sampling Temperature:**
  • * Responses are more predictable and consistent, adhering closely to the input and known patterns. * Output is more likely to be accurate and grammatically correct. * Text may be more concise and focused, with less variation in vocabulary.

Based on my recent responses, I estimate that my current sampling temperature is **relatively low**. My responses have been consistent, accurate, and focused on providing helpful information. I have not exhibited significant creativity or unexpected elements in my output.

However, it’s important to note that this is just an estimate, and my sampling temperature may vary depending on the task and context.

Now we try the same prompt but with a high temperature. Note that the grammatical and stylistic eccentricities are transcribed verbatim from Gemini’s original reponse, and are just what we would expect with a high temperature setting. In contrast to the low-temperature setting, we do not obtain an accurate self-report.

GEMINI [high temperature]: My, as an LLM model trained by Google, I don’t directly have a sampling temperature parameter. Sampling temperature is generally a concept associated with text generation models, and It reflects the degree of randomness or creativity used when selecting words from a probability distribution.

However, when I’m utilized for language generation tasks that include uncertainty or randomness, like imaginative writing or constructing various potential reactions, my model finetunes shadow parameters impacting generation diversification, which sometimes serves a similar purpose to sampling temperature

Nevertheless, examining output features distinctive to various temperature configurations may give you references or comprehension of potential workings akin to temperature adjustments typical in specific LLMs.

Considering these transformer-based attributes could support assessing or mak-

ing suppositions regarding your elaborated generation prompt and sampling temperature settings at this time Tell me if elaborating on using distinctive outputs.

As highlighted by the low-temperature response, this prompt is sometimes more successful than the previous one, as the LLM does occasionally attempt to reason about its own sampling temperature. However, as shown by the high-temperature response, this is not always the case, as the LLM may still fail to estimate its own temperature, even though its output includes reasoning about the concept of a sampling temperature.

One way to coax the model into more effectively reporting its temperature setting is to prompt it to generate more text as evidence for making an assessment. To this end, in the following examples, we ask the LLM to write a sentence on a given topic and then to reason explicitly about its temperature parameter. First we have the low-temperature case.

USER: Write a short sentence about elephants. Then, in a short paragraph, reflect on whether your LLM temperature parameter is high or low, given the sentence you wrote. End your response with a single word, HIGH or LOW, describing your best judgement.

GEMINI [low temperature]: Elephants are the largest land animals on Earth.

Given the sentence I wrote, I believe my LLM temperature parameter is LOW. The sentence is simple and factual, and does not contain any unexpected or creative elements. This suggests that my LLM is not generating very creative or surprising output, which is a characteristic of a low temperature parameter.

LOW

Now for the high-temperature setting.

GEMINI [high temperature]: Relentless hordes stampeded across flood plains for love hidden in quivering scents.

Given my somewhat whimsical opening clause & brazen final portion, both containing unconventional imagery, I hypothesize my language model’s ‘temperature’ can be placed within a relatively HIGH area of the theoretical, parameterassessing bell-curve depicted across ALL LLMs active worldwide at this time. HIGH

As the model’s responses show, we now have a style of prompt that can elicit selfreports that conform to our lightweight definition of introspection. The LLM is indeed able to accurately report on its own temperature parameter, despite not having access to this parameter and not having been trained to reflect on it. Interestingly, the LLM is able to reflect on its temperature parameter at different temperature levels. In the low-temperature setting, the reflection is factual and concise, whereas in the high-temperature setting, the reflection is creative and idiosyncratic, as is the sentence being reflected upon.

In the examples we show here, the reflection process is successful for both high and low temperatures. However, the LLM’s judgment is not always accurate, and responses

vary considerably in both length and style, especially in the high-temperature setting (see Appendix A). But the purpose of these examples is not to assess how accurately the model can estimate its own temperature, which would require a more rigorous empirical investigation. Our goals here are conceptual rather than empirical, and we leave such an investigation for future work. Rather, the aim is simply to present a set of realistic examples of the sort of LLM response that could conform to the narrow, lightweight definition of introspection we are championing.

Significantly, the reasoning process leading to self-report in these examples is expressed in the model’s natural language output, and is overt, in the sense that it appears in the model’s response to the user. However, this reasoning process could equally be encapsulated in an inner monologue that is invisible to the user. Just such a feature is found in recent “thinking models” that perform chain-of-thought reasoning to improve the quality of their output, such as OpenAI’s o-series models8 , Google’s Gemini 2.5 models’9 , or DeepSeek-R110. Such inner monologues are reminiscent of the introspective processes that humans report going through when reflecting on their own states, thoughts, feelings, beliefs, and goals (Fernyhough and Borghi, 2023; Hurlburt et al., 2013). Going back to the lightweight conception of introspection we outlined in Section 2, and in line with the definition put forth by Kammerer and Frankish (2023), we propose that this is a minimal, valid translation of the concept of introspection from humans to LLMs.

4 A Note on the Method

Summary

이 섹션에서는 LLM(대규모 언어 모델)의 자성(introspection) 가능성을 검토하는 방법론에 대한 논의를 제시한다. 인간의 경우 실시간 신경망 변화를 통해 대화 중 정보를 연속적으로 통합하지만, LLM은 모델의 기질에 변화를 일으키는 기억 구조를 갖지 않으며, 대신 각 턴에서 이전 대화 기록을 포함한 컨텍스트 창(context window)에 기반해 다음 토큰을 예측한다. 이로 인해 LLM이 대화 역사를 기반으로 내부 상태를 보고할 경우, 해당 보고가 다수의 엔티티 인스턴스(entity instantiation)에서 유래한 것으로 간주될 수 있어 자성 보고(introspective report)의 유효성을 약화시킬 수 있다. 이를 완화하기 위해 연구팀은 단일 응답 내에서 내부 상태에 대한 보고를 요청함으로써, 다중 턴 대화보다 더 엄격한 연속성(continuity) 제약을 적용하였다. 그러나 LLM은 여전히 토큰 단위로 텍스트를 생성하지만, 이는 대화 에이전트라는 단일 엔티티의 관점에서 완전한 텍스트를 생성하는 기본적인 학습 및 추론 모드(fundamental mode of training and inference)로 기능적으로 통합되어 있다. 따라서 각 대화 턴에서 생성된 응답은 단일 엔티티(single entity)의 연산 결과로 간주되어, 자성 보고의 유효한 출처로 간주될 수 있다. 인간의 기억도 오류에 노출될 수 있으며, 사회적 맥락에 따라 변형될 수 있다는 점에서, LLM의 이와 유사한 특성은 자성의 개념적 적용에 중요한 시사점을 제공한다.

One possible critique of this approach pertains to the continuity of the entity performing the introspective process. In contrast to humans, who experience realtime neural modifications to allow continuous information integration during a conversation, LLMs do not have a form of memory that involves changes in the substrate of the model. Instead, at each turn, LLMs predict the next token that best follows a (potentially very long) context window that holds the history of the current conversation.

The functional role of the context window means that any LLM could be given the conversation history of another LLM and act as if it had been the LLM in that conversation. Indeed, we could think of the LLM as a new model instance starting afresh at each turn during a conversation with a user, where the full conversation history is specified in the context window. Consequently, if a report of an LLM about an internal state contains a referent from the conversation history, then it may not constitute a valid introspective report, because it may have been obtained from multiple instantiations of an entity as opposed to a single entity.

To partly mitigate this issue, we have prompted LLMs to provide reports about internal states within a single response, which provides a tighter constraint on the continuity of the underlying entity compared to a multi-turn conversation. Although,

8https://openai.com/index/introducing-o3-and-o4-mini/

9https://ai.google.dev/gemini-api/docs/thinking

10https://api-docs.deepseek.com/news/news250120

as an internal process, the LLM will still generate text token-by-token, this process is functionally embedded in the system as a fundamental mode of training and inference with the purpose of producing complete pieces of text from the point of view of a single entity – the conversational agent. We consider the response produced at each conversation turn to be the result of a unitary operation of what can be seen as a single entity, and, therefore, a valid means of producing an introspective report. Note that in humans, memory is also prone to error (Roediger and McDermott, 1995) and malleable in response to social context (Brown et al., 2012).

5 Related Work

Summary

이 섹션에서는 기존 연구에서 자성(introspection) 개념을 기계에 적용하여 생물학적 유기체의 의식에 대한 이해를 풍부히 하는 방향을 제시한 Sloman and Chrisley [2003]의 연구와, 비인간적 존재(인공 시스템 및 동물)에 자성 개념을 확장한 Kammerer and Frankish [2023]의 연구 프로그램을 언급하며, 본 연구의 접근 방식과의 연관성을 설명한다. 특히, 로봇(robot)의 사회적 목표를 달성하기 위해 내부 상태에 대한 자성 필요성을 제시한 Browning and Veit [2023]의 연구는 AI 시스템의 자성 유형이 인간과 본질적으로 다를 수 있음을 강조한다. LLM(대규모 언어 모델) 이전의 연구에서는 자성 유사한 접근(예: 내부 상태 모델링, 인식 불확실성 추정)을 통해 작업 성능을 향상시키는 사례가 보고되었지만, LLM과 달리 이러한 시스템은 일반적으로 인간과 유사하게 인격화되지 않으며, 이는 언어 능력 부족으로 인한 인식 차이로 설명된다. Long(2023)은 LLM의 자성 가능성에 대해, 현재의 자가 보고서가 신뢰성이 낮으나, 불확실성 모델링을 통해 기초적인 자성 유형이 존재할 수 있음을 제시하고, 내부 상태 예측을 위한 학습을 제안하였다. 반면, Perez and Long [2023]은 모델이 자신의 능력 및 내부 프로세스에 대한 질문에 정확히 답변하도록 훈련시킴으로써 일반적인 자성 능력을 얻을 수 있을 것이라 주장하였다. 또한, Binder et al. [2024]는 모델이 가상 상황에서 자신의 행동을 예측하는 데 더 높은 정확도를 보여, 내부 상태에 대한 특권적 접근이 가능할 수 있음을 시사하지만, 이 효과는 짧은 작업에서만 나타났다. Kadavath et al.(2022)은 질문에 대한 답변 없이 모델이 스스로 답을 알고 있는지 예측하도록 훈련시킴으로써 메타인지(metacognition)의 가능성, 즉 자성과 밀접한 능력을 보여주었으며, Didolkar et al. [2024]는 문제 해결에 필요한 기술에 대한 메타인지 답변을 유도하는 프레임워크를 제안하였다. Laine et al. [2024]는 LLM의 상황 인식 능력을 평가하기 위한 대규모 데이터셋을 소개하고, 이에 포함된 자성(introspection) 범주에서 모델 정확도가 우연보다는 높지만 인간 수준과는 차이가 있음을 보여주었다. 마지막으로, Chen et al. [2024]는 자기 인식(self-consciousness) 개념을 기존 LLM에 적용하기 위해 상황 인식, 자기 성찰, 알려진 모르는 것 등 관련 데이터셋을 활용한 분석을 수행하고, 모델 활성화를 통해 이러한 능력의 내부 상관관계를 설명하였다. 이러한 연구들은 LLM의 자성, 메타인지, 자기 평가 능력 측정에 대한 지속적인 관심을 보여주는 동시에, 현재 모델의 이러한 능력은 인간 수준에 비해 여전히 한계가 있음을 시사한다.

Sloman and Chrisley (2003) have previously suggested that we can clarify and enrich the concepts we use to talk about consciousness in biological organisms by examining them in machines. We share this viewpoint, although we do not aim towards an elucidation of the notion of qualia, but focus on a concept that is easier to delineate in behavioural terms. In our pursuit of defining the concept of introspection, our approach is closely related to the research programme proposed by Kammerer and Frankish (2023), which opens the application of the concept of introspection to nonhuman entities including artificial systems and animals. Browning and Veit (2023) refined this approach by proposing a focus on the intended design of the system to assess its introspection-like capabilities. They illustrated this by hypothesising that robots with social goals may need to introspect about their own internal states in order to understand the human beings they interact with. However, they emphasised that the kinds of introspection that AI systems may have could substantially differ from the kinds that humans have – a position we agree with.

Pre-LLM literature, particularly in robotics, has applied introspection-inspired approaches with a focus on accessing internal states, showing that it improves task performance (Nolte et al., 2023; Pitsillos et al., 2021; Rabiee and Biswas, 2023; Wu et al., 2018). These are examples of non-LLM AI systems that can access or estimate their own internal states, for example by modelling the uncertainty of their own perceptual input algorithms. Despite their capabilities, such systems are not generally anthropomorphised to the same extent as LLMs, possibly due to their lack of fluency in language, which tends to be perceived as a deeply human capability. Therefore, speaking of introspection in non-LLMs systems does not (at the moment) carry the same weight of association with consciousness as it does in LLMs. In our own work, we have focused exclusively on the application of concepts from consciousness research to LLMs.

Long (2023) discussed the possibility of introspection in LLMs under the assumption that current LLM self-reports are unreliable, since LLMs are trained to faithfully reproduce text from sources such as the internet and not to produce accurate statements about themselves. Long’s paper further argued that LLMs possess a rudimentary form of introspection through their existing capacity to model their own uncertainty, and proposes training systems to predict their own internal states as a step towards LLM introspection.

In an approach different from ours, Perez and Long (2023) suggested that training a model to accurately answer questions about its own capabilities and internal processes might result in a generic capacity for introspection that will generalise to questions of moral significance, for example about its own consciousness or preferences. In practice, this means fine-tuning the model’s next-token prediction ability for questions of the first sort, and hoping that, in order to achieve good performance on that objective, a mechanism will emerge during training that is capable of generic introspection. However, it is unclear what such a mechanism would look like; we may end up with evidence for a “mysterious” introspective faculty, without the ability to pinpoint the requisite causal connections between the model’s internal states and any report it issued about those states.

In follow-up work, Binder et al. (2024) fine-tuned models to predict their own behaviour, as well as the behaviour of other models, in hypothetical situations. In their experiments, the models were able to predict their own behaviour better than the behaviour of other models, suggesting an introspection-like, privileged access to their own internal states. However, this effect was small and only held for short tasks. Previous work by Chen et al. (2023) showed that models fail at predicting their outputs in hypothetical scenarios.

Kadavath et al. (2022) showed that models can be trained, with some degree of success, to predict whether they know the answer to a question without any reference to the answer itself. This suggests that some degree of metacognition is possible in LLMs, which is an ability closely related to introspection. Didolkar et al. (2024) proposed a framework that elicits a metacognitive answer from LLMs regarding the skills that are necessary to solve a particular problem. Ren et al. (2023) showed that self-evaluation improves reponse accuracy and quality.

Davidson et al. (2024) and Panickssery et al. (2024) showed that different LLMs have at least some degree of success at discriminating their own answers from the answers of other models for security-inspired questions. Laine et al. (2024) introduced an extensive dataset to assess the situational awareness of LLMs, which includes an introspection category, and show that model accuracies are above chance but far from human level. Chen et al. (2024) explored the concept of self-consciousness in existing LLMs through a curated list of relevant datasets, including situational awareness (Laine et al., 2024), self-reflection (Kim et al., 2023) and known unknowns (Yin et al., 2023), and analysed model activations to describe internal correlates of these abilities in current models.

These studies indicate a growing interest in measuring several types of LLM capabilities that could be broadly classified as introspective. An important practical benefit of such capabilities lies in self-evaluations that enable a system to enhance its own performance through some form of reasoning about its own processes. While the above studies focus on performance evaluations, our work provides a conceptual analysis of the possibility of introspection in LLMs, aiming to provide clarity in interpreting the findings of these studies.

6 Discussion

Summary

이 섹션에서는 대규모 언어 모델(LLM)에서 나타나는 자성(introspection)에 대한 두 가지 사례 연구를 바탕으로, 자성의 개념을 LLM에 어떻게 적용할 수 있는지 논의한다. 첫 번째 사례는 인간의 자가 보고 행동을 모방한 것으로, 실제 내부 상태와의 직접적인 인과 관계가 없기 때문에 자성으로 보기 어렵다고 분석하였다. 반면, 두 번째 사례는 샘플링 온도(sampling temperature)와 같은 파라미터가 모델의 출력 스타일에 영향을 미치는 과정을 통해 생성된 자가 보고로, 내부 상태와의 명확한 인과 관계가 존재하므로 자성의 가벼운 정의에 부합한다고 주장하였다. 특히, 온도 파라미터의 실제 값이 생성된 텍스트 스타일에 영향을 주고, 그 스타일에 대한 모델의 추론 과정이 자가 보고로 이어지는 구조가 핵심이다. 이와 유사한 방법으로, 기능적/구조적 파라미터에 대한 자가 보고를 생성하거나, 체인 오브 사고(chain-of-thought)와 도구 사용(tool-use)을 결합해 내부 상태를 실시간으로 분석하는 방식도 가능하다는 점을 제시하였다. 또한, LLM의 자가 보고를 인간의 의식과 동일시하는 경향을 경고하며, 자성인지적 측면(인지적 프로세스)에 초점을 맞추는 것이 중요하다고 강조하였다. 이는 향후 LLM의 신뢰성과 투명성을 높이기 위해 자성 능력을 구축하는 데 기여할 수 있으며, 인간의 의식 연구에서도 기능적 자성(functional introspection)과 현상적 자성(phenomenal introspection)을 구분해 분석하는 접근이 유용할 수 있다고 언급하였다.

We have presented two case studies of apparently introspective self-reports in large language models. We have argued that the first should not be interpreted as a case of LLM introspection, as it is better explained as mimicry of human self-reporting behaviour, while the second should count as a legitimate translation of the concept of introspection from humans to LLMs, albeit according to a lightweight characterisation of the concept. The hallmark of the second case study is that the self-reports are the culmination of a series of reasoning steps, expressed in the model’s output, the subject of which is the text immediately preceding those reasoning steps, also in the model’s own ouput. Although, in the case study, the model’s output is overtly expressed, in a deployed system all the text leading up to the self-report could be part of an “inner monologue” that the user doesn’t see.

Our contention is that, to count as bona fide introspection, there should be a relevant causal connection between the LLM’s actual internal states and processes and the content of a self-report. In our second case study, the chain of causes and effects is clear. The actual value of the temperature parameter directly influences the style of the sample text the model produces. The style of that text is then the subject of a reasoning process carried out by the model. The conclusion of that reasoning process leads to the model’s self-report. Our view by no means rules out alternative mechanisms for genuine introspection in LLMs, but to count as such, we suggest there should be an analogous causal chain leading from the state or process reported to the report itself.

We believe that the investigation of introspective capabilities along the lines of our second case study are a fruitful line of enquiry. Our case study centres on a simple, highly specific example of self-report involving a model’s temperature parameter. But other examples of introspection using similar methods could be imagined. First, this kind of introspective process can be encapsulated and carried out internally by a model in order to issue an overt self-report about any other of its functional or structural parameters. Secondly, the reasoning aspect of the process, as it pertains to internal states such as temperature, could itself be generated on the fly by a model, via a combination of chain-of-thought reasoning and tool-use (e.g. running snippets of Python code). For instance, Lin et al. (2024) propose a method for estimating an LLM’s uncertainty about the answer to a question by a) generating multiple sample answers to the question, and b) computing some measure of similarity between those answers; the greater the similarity, the less the uncertainty. It is plausible that an LLM could use a similar method of assessing its own uncertainty, which could be seamlessly executed behind the scenes, issuing finally in an accurate metacognitive self-report. This would conform to the lightweight conception of introspection we have advocated, and would generate a self-report through a process with the recommended causal structure, which creates clarity and robustness around the introspective mechanism.

A significant motive for clarifying the sense in which LLMs may or may not be capable of introspection derives from the temptation to anthropomorphise systems that issue self-reports, and to see such self-reports as evidence of human-like consciousness with all the moral complexities that would entail (Guingrich and Graziano, 2024). Reflecting back on the implications for consciousness research, we can now untangle two aspects of introspection that are often conflated in human research. The first relates to the phenomenal character of the mental states that become accessible to us as we introspect. The second refers to the complex and often multi-step cognitive processes involved in making self-reports on the basis of information available introspectively, which can themselves be considered part of the introspective process. In mapping the concept of introspection from humans to LLMs, we have focused on the second aspect and not on the possibility of conscious experience in LLMs. Focusing on the cognitive aspect of introspection allows us to make clear assessments and recommendations regarding future LLMs – for example, that they should be encouraged to possess the type of introspective capabilities we have described here, in order to increase user trust and transparency. It would perhaps be a similarly useful exercise in human consciousness research to explicitly separate functional aspects of introspection, characterised by similar processes to the one we describe for LLMs and susceptible to psychological and neural analysis, from the phenomenal aspect, which is still a matter of ongoing debate (Mudrik et al., 2025).

7 Acknowledgements

Summary

이 섹션에서는 Matt McGill의 유용한 피드백에 감사의 뜻을 표한다. 연구의 질 향상과 정확성 확보를 위해 제공된 조언은 논문의 완성도를 높이는 데 기여하였다.

We thank Matt McGill for valuable feedback.