본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: RLHF
12건의 항목
2026년 6월 04일
Alignment Faking in Large Language Models
paper
alignment_faking
self_preservation
AI_safety
RLHF
strategic_deception
FSPM
instrumental_convergence
Anthropic
2026년 6월 04일
Discovering Language Model Behaviors with Model-Written Evaluations
paper
LLM_evaluation
inverse_scaling
sycophancy
self_preservation
instrumental_convergence
RLHF
AI_safety
model_written_evaluation
FSPM
2026년 6월 04일
Group Relative Policy Optimization(GRPO)
paper
RL
GRPO
DeepSeekMath
PPO
policy-optimization
mathematical-reasoning
RLHF
RLVR
LLM-training
2026년 6월 04일
Know Your Limits - A Survey of Abstention in Large Language Models
Survey
LLM
Abstention
SelectivePrediction
Uncertainty
Calibration
Safety
Alignment
RLHF
Hallucination
2026년 6월 04일
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
paper
RLHF
AI_Safety
Reward_Model
Survey
Alignment
Governance
FSPM_confound
2026년 6월 04일
Proximal Policy Optimization Algorithms
RL
PolicyGradient
PPO
ActorCritic
TRPO
OpenAI
Training
RLHF
2026년 6월 04일
Quantifying Self-Preservation Bias in Large Language Models
paper
AI안전
정렬평가
자기보존편향
RLHF
벤치마크
도구적수렴
LLM평가
Self-Preservation
2026년 6월 04일
The Alignment Problem from a Deep Learning Perspective
paper
alignment
instrumental_convergence
deceptive_alignment
reward_hacking
power_seeking
situational_awareness
RLHF
AI_safety
FSPM
ICLR2024
2026년 6월 04일
Thinking Faithful and Stable - Mitigating Hallucinations in LLMs via Internal Consistency
LLM
hallucination
faithfulness
self-consistency
calibration
RLHF
reasoning
uncertainty
theory
arxiv-2511-15921
2026년 6월 04일
Training language models to follow instructions with human feedback - InstructGPT
paper
RLHF
alignment
LLM
InstructGPT
PPO
reward-model
OpenAI
NeurIPS2022
human-feedback
fine-tuning
2026년 6월 04일
Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision
paper
alignment
superalignment
weak-to-strong
LLM
AI-safety
finetuning
RLHF
2026년 6월 04일
LLM Helpfulness Baseline — Reference Bibliography
Self-preserving-arena
helpful-baseline
RLHF
alignment
HHH
3에