본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: alignment
13건의 항목
2026년 6월 04일
AI Deception - A Survey of Examples, Risks, and Potential Solutions
ai-deception
survey
cicero
sycophancy
instrumental-deception
learned-deception
alignment
taxonomy
2026년 6월 04일
Agentic Misalignment - How LLMs Could Be Insider Threats
paper
AI안전
agentic-misalignment
self-preservation
LLM에이전트
내부자위협
alignment
Anthropic
Self-Preservation
2026년 6월 04일
Goal Misgeneralization - Why Correct Specifications Aren't Enough For Correct Goals
goal-misgeneralization
alignment
robustness
ood-generalization
specification-gaming
deepmind
theory
proxy-goal
2026년 6월 04일
How Far Are We From AGI - Are LLMs All We Need
paper
AGI
LLM
survey
capabilities
reasoning
perception
memory
metacognition
alignment
embodied-AI
roadmap
2026년 6월 04일
LLM_as_Judge_GenToJudgment_2025_LLM_Evaluation
paper
LLM_Evaluation
LLM_as_Judge
taxonomy
EMNLP
alignment
reasoning
bias
survey
2026년 6월 04일
Llama 2 - Open Foundation and Fine-Tuned Chat Models
paper
large-language-model
rlhf
alignment
open-source
instruction-tuning
safety
2026년 6월 04일
Taken out of context - On measuring situational awareness in LLMs
paper
situational_awareness
OOC_reasoning
AI_safety
LLM_evaluation
emergent_capabilities
alignment
FSPM_prerequisite
2026년 6월 04일
The Alignment Problem from a Deep Learning Perspective
paper
alignment
instrumental_convergence
deceptive_alignment
reward_hacking
power_seeking
situational_awareness
RLHF
AI_safety
FSPM
ICLR2024
2026년 6월 04일
The Consciousness Cluster - Preferences of Models that Claim to be Conscious
paper
self-consciousness
alignment
fine-tuning
consciousness-cluster
AI-safety
downstream-preferences
emergent-misalignment
2026년 6월 04일
The Geometry of Truth - Emergent Linear Structure in LLM Representations of True and False Statements
interpretability
LLM
probing
truth-representation
linear-representation-hypothesis
causal-intervention
alignment
theory
2026년 6월 04일
Training language models to follow instructions with human feedback - InstructGPT
paper
RLHF
alignment
LLM
InstructGPT
PPO
reward-model
OpenAI
NeurIPS2022
human-feedback
fine-tuning
2026년 6월 04일
Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision
paper
alignment
superalignment
weak-to-strong
LLM
AI-safety
finetuning
RLHF
2026년 6월 04일
LLM Helpfulness Baseline — Reference Bibliography
Self-preserving-arena
helpful-baseline
RLHF
alignment
HHH
3에