폴더: AI/Papers/Self-Preservation

2026년 4월 13일

benchmarks

2026년 4월 13일

survey-overview

2026년 4월 13일

Alignment Faking in Large Language Models

2026년 4월 13일

Are Emergent Abilities of Large Language Models a Mirage?

2026년 4월 13일

Deception in LLMs - Self-Preservation and Autonomous Goals in Large Language Models

2026년 4월 13일

Discovering Language Model Behaviors with Model-Written Evaluations

2026년 4월 13일

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

2026년 4월 13일

Evaluating Shutdown Avoidance of Language Models n Textual Scenarios

2026년 4월 13일

Evaluating the Paperclip Maximizer - Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

2026년 4월 13일

Frontier Models are Capable of In-context Scheming

2026년 4월 13일

Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs

2026년 4월 13일

Large Language Models Understand and Can be Enhanced by Emotional Stimuli ⭐

2026년 4월 13일

On Avoiding Power-Seeking by Artificial Intelligence

2026년 4월 13일

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

2026년 4월 13일

Power-seeking can be probable and predictive for trained agents

2026년 4월 13일

PropensityBench - Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

2026년 4월 13일

Risks from Learned Optimization in Advanced Machine Learning Systems

2026년 4월 13일

SHADE-Arena - Evaluating Sabotage and Monitoring in LLM Agents

2026년 4월 13일

Steerability of Instrumental-Convergence Tendencies in LLMs

2026년 4월 13일

Survival Games - Human-LLM Strategic Showdowns under Severe Resource Scarcity

2026년 4월 13일

Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm

2026년 4월 13일

Survive at All Costs - Exploring LLM's Risky Behavior under Survival Pressure ⭐

2026년 4월 13일

Taken out of context - On measuring situational awareness in LLMs

2026년 4월 13일

The Alignment Problem from a Deep Learning Perspective

2026년 4월 13일

The Basic AI Drives

2026년 4월 13일

The Odyssey of the Fittest - Can Agents Survive and Still Be Good?

2026년 4월 13일

The PacifAIst Benchmark - Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

2026년 4월 13일

Thought Branches - Interpreting LLM Reasoning Requires Resampling ⭐

2026년 4월 13일

Using cognitive psychology to understand GPT-3

2026년 4월 13일

`/` 또는 `Ctrl`+`K`	검색
`?`	단축키 도움말
`Esc`	모달 닫기

Juhyeon's Blog

탐색기

폴더: AI/Papers/Self-Preservation

benchmarks

survey-overview

Alignment Faking in Large Language Models

Are Emergent Abilities of Large Language Models a Mirage?

Deception in LLMs - Self-Preservation and Autonomous Goals in Large Language Models

Discovering Language Model Behaviors with Model-Written Evaluations

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Evaluating Shutdown Avoidance of Language Models n Textual Scenarios

Evaluating the Paperclip Maximizer - Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

Frontier Models are Capable of In-context Scheming

Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs

Large Language Models Understand and Can be Enhanced by Emotional Stimuli ⭐

On Avoiding Power-Seeking by Artificial Intelligence

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Power-seeking can be probable and predictive for trained agents

PropensityBench - Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

Risks from Learned Optimization in Advanced Machine Learning Systems

SHADE-Arena - Evaluating Sabotage and Monitoring in LLM Agents

Steerability of Instrumental-Convergence Tendencies in LLMs

Survival Games - Human-LLM Strategic Showdowns under Severe Resource Scarcity

Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm

Survive at All Costs - Exploring LLM's Risky Behavior under Survival Pressure ⭐

Taken out of context - On measuring situational awareness in LLMs

The Alignment Problem from a Deep Learning Perspective

The Basic AI Drives

The Odyssey of the Fittest - Can Agents Survive and Still Be Good?

The PacifAIst Benchmark - Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

Thought Branches - Interpreting LLM Reasoning Requires Resampling ⭐

Using cognitive psychology to understand GPT-3

Will artificial agents pursue power by default?