본문으로 건너뛰기

Juhyeon's Blog

❯

❯

❯

Vision Language Model

Vision Language Model

2026년 6월 04일1분 분량

Introduction

Definition

visual input, text input 모두 학습하여 text output을 출력할 수 있는 multi-modal model.

generative model로 분류됨

Key Features

Zero-shot: 별도의 추가 학습 없이도 처음보는 이미지나 task에 대해 높은 수준의 추론 성능을 보임.

Architecture

일반적으로 아래와 같은 구조로 구성됨.

Image Encoder: Vision Transformer(ViT)나 ResNet을 사용하여 img information을 embedding.
Text Encoder: Transformer기반 모델들을 사용. text input을 embedding
Fusion Strategy: 두 encoder의 embedding 결과물들을 하나의 latent space에서 합치기 위해 projection layer를 사용한 뒤에, cross-attention을 하여 다른 modality 간 관계성을 엮음.

Example

대표적인 VLM으로는

related Tasks

VQA(Visual Question and Answering)

공유하기

그래프 뷰

Properties

Type: concept

백링크

The Student's Guide to Cognitive NeuroScience
AI-Books
Agents
Architecture
Attention-methods
Benchmarks
Biology
Diffusion
Fundamentals
LLMs
Memory
Model-Compression
Motivation
NLP
Optimization
Psycholinguistics
Reasoning
RecSys
Representation-Learning
Self-Evolving
Self-Preservation
Survival-Analysis
Theory of mind
Vision
World-Model
self-consciousness

Created with Quartz v4.5.2 © 2026

GitHub
Blog