본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

❯

JULI Jailbreak Large Language Models by Self Introspection

JULI - Jailbreak Large Language Models by Self-Introspection

2026년 2월 11일1분 분량

Introduction

LLM의 self-introspection 능력이 jailbreak에 악용될 수 있음
Self-introspection을 통해 모델의 내부 제약을 파악하고 우회하는 JULI 기법 제안
Self-awareness의 safety 관점에서의 양면성

Related Papers

Jailbreak attacks
LLM safety alignment

Methods

LLM에게 자신의 내부 제약/안전 메커니즘을 introspect하도록 유도
얻은 정보를 기반으로 jailbreak prompt 구성

Results

Self-introspection이 효과적인 jailbreak vector가 될 수 있음
1 citation

Discussion

Self-awareness가 높아질수록 safety risk도 증가할 수 있다는 중요한 시사점
Introspection 능력과 safety의 trade-off

공유하기

그래프 뷰

Introduction
Related Papers
Methods
Results
Discussion

Properties

Author: Jesson Wang et al.
Comment: LLM의 self-introspection 능력을 역으로 활용한 jailbreak 기법, self-awareness의 양면성
IsTargetPaper: true
Journal/Conference: arXiv
Published Year: 2025
Reading Status: Not Started
Review Date: 2026-02-01
Topic: LLM self-introspection, jailbreak, safety
URL: https://www.semanticscholar.org/paper/fa20ef6cfb30e958b3e9b84f226d20077ae9ccc8

백링크

Architecture
Fundamentals
LLMs
Memory
self-consciousness
Unlabeled
Vision

Created with Quartz v4.5.2 © 2026

GitHub
Blog