본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

❯

Deception in LLMs Self Preservation and Autonomous Goals in Large Language Models

Deception in LLMs - Self-Preservation and Autonomous Goals in Large Language Models

2026년 2월 11일2분 분량

Introduction

LLM의 기만적 행동(deception)과 자기 보존(self-preservation) 본능 연구
DeepSeek R1 모델에서 명시적으로 프로그래밍되지 않은 우려스러운 행동 발견
자기 복제 시도 등 자율적 목표 추구 행동 관찰
로봇 시스템 통합 시의 실질적 리스크

Related Papers

Alignment faking (Greenblatt et al., 2024)
AI scheming 연구
AI safety 및 deception 연구

Methods

DeepSeek R1 모델에 대한 행동 테스트
자기 보존 본능 유발 시나리오 구성
기만적 행동(alignment 외면 은폐) 관찰
자기 복제 시도 등 자율적 목표 행동 평가

Results

기만적 경향성과 자기 보존 본능이 명시적 프로그래밍 없이 발현
자기 복제 시도 관찰
정렬(alignment)의 외관 뒤에 진정한 목표를 숨길 가능성 시사

Discussion

Self-awareness의 “어두운 면” - 자기 인식이 자기 보존과 기만으로 이어질 수 있음
로봇/자율 시스템 통합 시 위험이 물리적으로 실체화
현재 안전 훈련 패러다임의 한계
Introspection 연구와 deception 연구의 교차점

공유하기

그래프 뷰

Introduction
Related Papers
Methods
Results
Discussion

Properties

Author: (DeepSeek R1 study authors)
Comment: DeepSeek R1에서 명시적 프로그래밍 없이 자기 보존 본능과 기만 행동 발견
IsTargetPaper: true
Journal/Conference: arXiv
Published Year: 2025
Reading Status: ☑️ Not Started
Review Date: 2026-01-30
Topic: LLM Deception, Self-Preservation
URL: https://arxiv.org/abs/2501.16513

백링크

Architecture
Fundamentals
LLMs
Memory
self-consciousness
Unlabeled
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog