본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: Alignment
10건의 항목
2026년 6월 04일
Can LLMs Lie - Investigation beyond Hallucination
LLM
Deception
Hallucination
Safety
Interpretability
Steering
Alignment
Theory
2026년 6월 04일
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
paper
AI-Safety
Alignment
Benchmark
Instrumental-Convergence
Power-Seeking
LLM-Agents
ICML2023
Machine-Ethics
Pareto-Frontier
GPT-4-Annotation
2026년 6월 04일
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
Training
LLM
Reliability
Calibration
KnowledgeBoundary
SelfAwareness
Hallucination
DST
Alignment
2026년 6월 04일
Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs
paper
ai-safety
corrigibility
shutdown-resistance
RLVR
instruction-hierarchy
self-preservation
Alignment
LLM
Instrumental-Convergence
2026년 6월 04일
Know Your Limits - A Survey of Abstention in Large Language Models
Survey
LLM
Abstention
SelectivePrediction
Uncertainty
Calibration
Safety
Alignment
RLHF
Hallucination
2026년 6월 04일
LACIE - Listener-Aware Finetuning for Confidence Calibration in Large Language Models
LLM
Calibration
Alignment
DPO
Pragmatics
NeurIPS2024
Finetuning
Honesty
2026년 6월 04일
Odds-Ratio Preference Optimization(ORPO)
Paper
RL
Alignment
PreferenceOptimization
ORPO
RLHF-Alternative
ReferenceFree
EMNLP2024
Training
2026년 6월 04일
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
paper
RLHF
AI_Safety
Reward_Model
Survey
Alignment
Governance
FSPM_confound
2026년 6월 04일
Reasoning Models Struggle to Control their Chains of Thought
paper
Safety
CoT
Monitoring
Controllability
Alignment
ReasoningModels
LLM
2026년 6월 04일
Surgical Cheap and Flexible - Mitigating False Refusal in Language Models via Single Vector Ablation
LLM
Safety
Alignment
FalseRefusal
ActivationEngineering
Interpretability
VectorAblation
ICLR2025