본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

❯

Surgical Cheap and Flexible Mitigating False Refusal in Language Models via Single Vector Ablation

Surgical Cheap and Flexible - Mitigating False Refusal in Language Models via Single Vector Ablation

2026년 2월 11일1분 분량

Introduction

LLM의 false refusal 문제: 안전한 요청도 거부하는 현상
Single vector ablation으로 false refusal 감소

Related Papers

Refusal mechanisms
Activation engineering

Methods

False refusal vector를 representation space에서 식별
Vector ablation으로 false refusal 제거
Safety와 general capability 보존 확인

Results

False refusal rate 유의미하게 감소
Safety 및 general performance 유지
Fine-grained safety calibration 가능

Discussion

Refusal behavior가 representation space에서 linear하게 표현됨
Self-knowledge와 refusal의 representation-level 관계

공유하기

그래프 뷰

Introduction
Related Papers
Methods
Results
Discussion

Properties

Author: Xinpeng Wang et al.
Comment: Single vector ablation으로 LLM의 false refusal을 줄이면서 safety 유지 - refusal behavior의 representation-level 이해
IsTargetPaper: true
Journal/Conference: arXiv
Published Year: 2024
Reading Status: Not Started
Review Date: 2026-02-01
Topic: False refusal, vector ablation, refusal calibration
URL: https://arxiv.org/abs/2410.03415

백링크

Architecture
Fundamentals
LLMs
Memory
self-consciousness
Unlabeled
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog