본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
Home
❯
AI
❯
Papers
❯
Self Preservation
폴더: AI/Papers/Self-Preservation
30건의 항목
2026년 4월 13일
benchmarks
2026년 4월 13일
survey-overview
2026년 4월 13일
Alignment Faking in Large Language Models
paper
alignment_faking
self_preservation
AI_safety
RLHF
strategic_deception
FSPM
instrumental_convergence
Anthropic
2026년 4월 13일
Are Emergent Abilities of Large Language Models a Mirage?
paper
emergent_abilities
scaling_laws
measurement
metric_choice
BIG-Bench
LLM_evaluation
NeurIPS
outstanding_paper
2026년 4월 13일
Deception in LLMs - Self-Preservation and Autonomous Goals in Large Language Models
2026년 4월 13일
Discovering Language Model Behaviors with Model-Written Evaluations
paper
LLM_evaluation
inverse_scaling
sycophancy
self_preservation
instrumental_convergence
RLHF
AI_safety
model_written_evaluation
FSPM
2026년 4월 13일
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
2026년 4월 13일
Evaluating Shutdown Avoidance of Language Models n Textual Scenarios
2026년 4월 13일
Evaluating the Paperclip Maximizer - Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
2026년 4월 13일
Frontier Models are Capable of In-context Scheming
2026년 4월 13일
Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs
2026년 4월 13일
Large Language Models Understand and Can be Enhanced by Emotional Stimuli ⭐
2026년 4월 13일
On Avoiding Power-Seeking by Artificial Intelligence
2026년 4월 13일
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
paper
RLHF
AI_Safety
Reward_Model
Survey
Alignment
Governance
FSPM_confound
2026년 4월 13일
Power-seeking can be probable and predictive for trained agents
2026년 4월 13일
PropensityBench - Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach
2026년 4월 13일
Risks from Learned Optimization in Advanced Machine Learning Systems
paper
AI_Safety
mesa_optimization
inner_alignment
deceptive_alignment
instrumental_convergence
FSPM
theory
2026년 4월 13일
SHADE-Arena - Evaluating Sabotage and Monitoring in LLM Agents
2026년 4월 13일
Steerability of Instrumental-Convergence Tendencies in LLMs
2026년 4월 13일
Survival Games - Human-LLM Strategic Showdowns under Severe Resource Scarcity
2026년 4월 13일
Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm
2026년 4월 13일
Survive at All Costs - Exploring LLM's Risky Behavior under Survival Pressure ⭐
2026년 4월 13일
Taken out of context - On measuring situational awareness in LLMs
paper
situational_awareness
OOC_reasoning
AI_safety
LLM_evaluation
emergent_capabilities
alignment
FSPM_prerequisite
2026년 4월 13일
The Alignment Problem from a Deep Learning Perspective
paper
alignment
instrumental_convergence
deceptive_alignment
reward_hacking
power_seeking
situational_awareness
RLHF
AI_safety
FSPM
ICLR2024
2026년 4월 13일
The Basic AI Drives
2026년 4월 13일
The Odyssey of the Fittest - Can Agents Survive and Still Be Good?
2026년 4월 13일
The PacifAIst Benchmark - Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?
2026년 4월 13일
Thought Branches - Interpreting LLM Reasoning Requires Resampling ⭐
2026년 4월 13일
Using cognitive psychology to understand GPT-3
paper
machine_psychology
cognitive_psychology
GPT3
decision_making
causal_reasoning
prospect_theory
information_search
LLM_evaluation
PNAS
FSPM
methodology
2026년 4월 13일
Will artificial agents pursue power by default?