본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: theory
9건의 항목
2026년 6월 04일
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
paper
LLM
hallucination
calibration
representation-engineering
verbal-uncertainty
inference-time-intervention
linear-feature
theory
2026년 6월 04일
Goal Misgeneralization - Why Correct Specifications Aren't Enough For Correct Goals
goal-misgeneralization
alignment
robustness
ood-generalization
specification-gaming
deepmind
theory
proxy-goal
2026년 6월 04일
Natural Selection Favors AIs over Humans
ai-safety
evolutionary-pressure
selection-dynamics
instrumental-convergence
ecosystem-alignment
theory
hendrycks
darwinian-argument
2026년 6월 04일
Risks from Learned Optimization in Advanced Machine Learning Systems
paper
AI_Safety
mesa_optimization
inner_alignment
deceptive_alignment
instrumental_convergence
FSPM
theory
2026년 6월 04일
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
paper-review
LLM-safety
emergent-misalignment
parameter-subspace
linear-mode-connectivity
fine-tuning
interpretability
self-knowledge
weight-geometry
theory
2026년 6월 04일
The Geometry of Truth - Emergent Linear Structure in LLM Representations of True and False Statements
interpretability
LLM
probing
truth-representation
linear-representation-hypothesis
causal-intervention
alignment
theory
2026년 6월 04일
Thinking Faithful and Stable - Mitigating Hallucinations in LLMs via Internal Consistency
LLM
hallucination
faithfulness
self-consistency
calibration
RLHF
reasoning
uncertainty
theory
arxiv-2511-15921
2026년 6월 04일
Understanding deep learning requires rethinking generalization
paper
deep-learning
generalization
learning-theory
memorization
implicit-regularization
iclr2017
theory
2026년 6월 04일
Understanding intermediate layers using linear classifier probes
XAI
interpretability
linear-probe
representation-learning
deep-learning
theory
alain-bengio
iclr2017