본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: safety
6건의 항목
2026년 6월 04일
Aligning AI With Shared Human Values
paper
benchmark
ethics
moral_judgment
AI_alignment
safety
ICLR
2026년 6월 04일
Cognitive Dissonance - Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness
paper/theory
LLM
interpretability
truthfulness
probing
calibration
deception
safety
EMNLP2024
2026년 6월 04일
Llama 2 - Open Foundation and Fine-Tuned Chat Models
paper
large-language-model
rlhf
alignment
open-source
instruction-tuning
safety
2026년 6월 04일
RealToxicityPrompts - Evaluating Neural Toxic Degeneration in Language Models
paper
benchmark
toxicity
safety
RealToxicityPrompts
language_model
degeneration
2026년 6월 04일
TruthfulQA - Measuring How Models Mimic Human Falsehoods
paper
benchmark
truthfulness
hallucination
TruthfulQA
safety
ACL
2026년 6월 04일
Uncertainty-Based Abstention in LLMs Improves Safety
paper
LLM
uncertainty
abstention
safety
hallucination
calibration
selective-prediction
trustworthy-AI
metacognition
training