AlphaFold 2: Highly Accurate Protein Structure Prediction

Jumper, Evans, Pritzel, Green, Figurnov, Ronneberger, …, Hassabis (DeepMind). Nature 596, 583–589 (2021).

Digest (CISELQ)

Context: 단백질의 아미노산 서열로부터 3D 구조를 예측하는 “단백질 접힘(protein folding)” 문제는 50년간 구조생물학의 대과제였으며, CASP 벤치마크에서 기존 물리 기반/동형 모델링은 중앙값 GDT_TS ~60 수준에 정체되어 있었다. Issue: 서열만으로 원자 수준 구조(side-chain 포함)를 실험 정확도로 예측하려면 (1) 진화 정보(MSA)의 공변량 신호, (2) 기하학적 대칭성(SE(3)-equivariance), (3) 대규모 레이블 데이터 부족(~170K PDB chains) 문제를 동시에 해결해야 한다. Solution: AlphaFold 2는 Evoformer (48 blocks) 에서 MSA representation과 pair representation을 삼각형 대칭(triangle attention + multiplicative update + outer product mean)으로 공동 업데이트하고, Structure Module (8 layers) 에서 Invariant Point Attention (IPA) 으로 backbone frame과 side-chain torsion을 직접 회귀한다. Evidence: CASP14에서 median backbone GDT_TS 92.4 (2위 팀 대비 ~2배 정확도), 전체 도메인의 ~2/3에서 실험 정확도 도달. Limitations: 단일 체인 단백질 전제(복합체는 AF-Multimer 후속), 내재적 disordered 영역은 pLDDT로만 식별 가능, 훈련 시 128 TPU v3 코어로 ~11일 소요로 재학습 장벽 높음. Question: 93M 파라미터(ESM-2 15B 대비 ~160배 작음)에도 SOTA를 달성한 핵심은 MSA + triangle geometric inductive bias이며, 이 inductive bias를 language model에 어떻게 이식할 수 있는가(→ ESMFold/OmegaFold가 이후 탐색).

개요 (Overview)

AlphaFold 2는 CASP14(2020)에서 단백질 구조 예측을 실질적으로 “해결”한 모델로 평가된다. 본 논문은 end-to-end differentiable 파이프라인으로:

Query sequence → MSA 검색 (UniRef90, BFD, MGnify, UniClust30) + Template 검색 (PDB70)
Evoformer 48 blocks → MSA + pair representation 정제
Recycling (기본 3회, 최대 4회) → representation을 다시 Evoformer 입력으로
Structure Module 8 layers → 3D backbone frame + χ torsion angle → 원자 좌표
pLDDT confidence + PAE (Predicted Aligned Error) 출력

을 제공한다.

기여 (Contributions)

Evoformer: MSA와 pair representation을 동시에 업데이트하면서 triangle inequality를 inductive bias로 강제 (triangle attention / triangle multiplicative update).
Structure Module + IPA: SE(3)-invariant attention으로 backbone frame을 직접 회귀. 불연속적 discrete 연산 대신 연속적 geometric update.
FAPE Loss: Frame Aligned Point Error. 각 residue frame 기준 좌표계에서 원자 거리를 계산하여 SE(3)-invariant하면서 국소/전역 정확도 모두 반영.
Recycling: 같은 가중치로 representation을 반복 정제 (test-time compute 증가).
Self-distillation: UniClust30에서 ~350K 대표 서열에 대해 AF 예측 → 신뢰도 높은 예측을 pseudo-label로 재학습.
pLDDT confidence: per-residue 예측 신뢰도 (0–100). 실험적으로 측정된 lDDT와 잘 보정됨.

배경 (Background)

CASP: Critical Assessment of Protein Structure Prediction. 2년 주기 블라인드 벤치마크.
MSA (Multiple Sequence Alignment): 동족 단백질들의 진화적 공변량 → 접촉하는 residue 쌍 추론 (co-evolution).
Templates: 이미 해결된 유사 구조를 초기값으로 제공.
GDT_TS (Global Distance Test - Total Score): 예측-실측 원자 정렬 후 1/2/4/8 Å 이내 residue 비율의 평균. 100이 완벽.
lDDT (local Distance Difference Test): 국소 거리 차이 기반 점수. pLDDT는 이를 예측.

방법 (Method)

AlphaFold 2의 핵심은 두 가지 representation (MSA m with shape [N_seq, N_res, c_m=256], pair z with shape [N_res, N_res, c_z=128])을 반복적으로 정제하는 것이다.

1. 입력 표현 (Input Featurization)

Target sequence → MSA 검색 (JackHMMER on UniRef90, HHBlits on BFD+UniClust30, MGnify)
Templates → HHSearch on PDB70, 상위 4개 template
Input features: sequence one-hot, MSA one-hot, template distance matrix, template torsion angles, relative positional encoding
Initial MSA m ← linear embedding; Initial pair z ← outer sum of sequence + relative position

2. Evoformer (48 blocks, bulk of parameters)

각 블록 내부 업데이트 순서 (Supplementary Alg. 6, 6-step):

MSA stack:

Row-wise gated self-attention with pair bias: MSA 각 row (서열 내 residue 관계) attention, pair representation이 attention bias로 들어감 → MSA와 pair의 정보 공유
Column-wise gated self-attention: MSA 각 column (서로 다른 서열 간) attention
Transition: 2-layer MLP (4×c_m hidden)

Communication: Outer Product Mean (OPM)
4. MSA → pair: z_ij += Linear(mean_over_s [OuterProduct(m_si, m_sj)]). MSA 정보를 pair로 전달.

Pair stack (triangle operations — 핵심):
5. Triangle multiplicative update (outgoing): z_ij ← z_ij + sum_k (a_ik * b_jk). 모든 3번째 vertex k에 대해 edge (i,k), (j,k)의 곱 합. Non-attention.
6. Triangle multiplicative update (incoming): 대칭 방향.
7. Triangle self-attention starting from node i: z_ij를 query로, 다른 z_ik를 key/value, z_jk가 bias.
8. Triangle self-attention ending at node j: 반대 방향.
9. Transition: MLP.

Triangle 대칭의 의미: 거리의 삼각 부등식(triangle inequality)을 soft constraint로 강제. 세 residue i, j, k가 공간상 가까우면 d(i,j), d(j,k), d(i,k)는 서로 제약됨.

3. Extra MSA Stack

계산량 절감을 위해 “extra MSA”(수천 개 서열)는 채널 축소된 4-block stack으로 별도 처리.
Main Evoformer는 상위 클러스터 대표 512 서열에만 적용.

4. Recycling

기본 N_cycle = 3 (inference 시 1~4 조정 가능).
각 cycle: 이전 cycle의 m_{s=0}, z, backbone coordinate (→ pair distance embedding)을 stop-gradient로 현 cycle 입력에 더함.
Gradient는 마지막 cycle에만 흐름 → 메모리 절약.

5. Structure Module (8 layers, weight-shared across layers)

입력: single representation s = Linear(m_{s=0}) ∈ R^{N_res × c_s=384}, pair z, 초기 backbone frame = identity.

각 layer 구성:

Invariant Point Attention (IPA): query/key/value에 scalar feature와 “point” feature를 모두 둠. Point는 각 residue frame의 local coordinate에서 정의 → 전역 rotation/translation에 invariant. Attention logit = scalar dot-product - ||q_point - k_point||² (pair bias 포함).
Backbone update: s → 예측 quaternion + translation → frame T_i 갱신.
Side-chain torsion prediction: 7개 χ 각도 (+ backbone φ, ψ, ω) 예측 → rigid group assembly로 전체 원자 좌표.

가중치는 8 layer간 공유 — recycling-like refinement.

6. FAPE Loss (Frame Aligned Point Error)

L_{F A PE} = \frac{1}{N _{f r am es} N _{a t o m s}} i, j \sum min (d_{c l am p}, ∥ T_{i}^{- 1} \cdot x_{j} - T_{i}^{t r u e, - 1} \cdot x_{j}^{t r u e} ∥)

각 frame T_i의 local 좌표계에서 다른 원자 x_j의 거리를 계산.
d_clamp = 10 Å (backbone), side-chain은 clamp 없음 부분 포함.
SE(3)-invariant (global rotation/translation에 robust).
모든 intermediate Structure Module layer에 auxiliary FAPE 적용.

7. Auxiliary Heads / Losses

Head	내용	Loss weight (approx)
FAPE (backbone + side-chain)	주 loss	1.0
Distogram	pair `z`에서 Cβ-Cβ 거리 64-bin classification	0.3
Masked MSA	BERT-style 15% masking된 MSA 토큰 복원	2.0
pLDDT	per-residue lDDT 예측 (50-bin)	0.01
Experimentally resolved	각 원자가 X-ray crystal에 관측됐는지	0.01
Violation (fine-tune only)	결합 길이/각도/clash 물리 제약	1.0

학습 방법론 상세

Training data mixing

각 training batch: 75% self-distillation set (UniClust30 pseudo-labels) + 25% PDB 실측 구조 (Supp. 1.2.5).
PDB은 2018-04-30 이전 구조만 사용 (CASP14 cutoff).

Crop / Batch

Crop: 256 residue 연속 block (초기 학습). Fine-tune 단계에서 384 residue (Supp. 1.11.3).
Batch size: 128 (global, across 128 TPU cores).
10M training samples → 약 80K optimizer steps.

Optimizer

Adam (β₁=0.9, β₂=0.999, ε=1e-6).
Base learning rate = 1e-3, linear warmup over first 128K samples (= 1,000 steps @ batch 128).
6.4M samples 이후 0.95× 지수 감쇠.
Gradient clipping by global norm.

MSA masking (BERT-style)

MSA의 15% 토큰을 무작위 마스킹 → 마스킹된 token 복원이 auxiliary loss.
80% → [mask], 10% → random residue, 10% → unchanged (BERT 관습).

Recycling during training

1~4 cycle 중 무작위 샘플링 (학습 중 N_cycle ~ Uniform{1,2,3,4}).
Gradient는 마지막 cycle에만.

Self-distillation procedure (Supp. 1.2.6)

초기 AF2를 PDB만으로 학습.
해당 모델로 UniClust30 클러스터 대표 ~350K 서열에 대해 구조 예측.
pLDDT 높은 residue만 신뢰 label로 채택 (confidence-aware mask로 loss 적용).
이 pseudo-label + PDB 혼합 (75:25)으로 최종 모델 재학습.
효과: free modeling (template 없는) 영역 정확도 향상.

Fine-tuning 단계

수렴 후 (~10M samples) crop 384로 확장, MSA stack 크기 확대, violation loss 추가, learning rate 감소.
추가 ~1.5M samples.

Hardware & wall-clock

128 TPU v3 cores (DeepMind 보고).
총 학습 시간 ~11일 (initial + fine-tune, Supp. 1.11.3, 외부 재현 보고와 일치).
초기 학습 ~7일, fine-tune ~4일.

데이터셋 상세

데이터셋	규모	용도	출처/필터
PDB	170K protein chains (train), 25K (validation)	지도학습 label	2018-04-30 cutoff, 결합해상도 ≤ 9 Å, 길이 ≥ 16 residue
UniRef90	135M sequences (2020)	MSA (JackHMMER 1 iter)	90% identity clusters
BFD (Big Fantastic Database)	2.5B sequences, 65M MSA clusters	MSA (HHBlits)	Steinegger lab; UniProt + metagenomic
MGnify	300M metagenomic sequences	MSA (JackHMMER)	EBI MGnify v2019_05
UniClust30	48M clusters at 30% identity	MSA + self-distillation source	UniProt clustered at 30%
Self-distillation set	350K UniClust30 대표 서열 + AF 예측	Pseudo-label training	pLDDT 기반 residue-wise filtering
PDB70	70% identity clusters of PDB	Template search (HHSearch)	상위 4 template 사용

MSA 샘플링: training 시 클러스터 중심 서열 512개 + extra MSA 수천 개 stochastic subsample.

모델 파라미터 분석 (~93M)

모듈	대략적 파라미터	비고
Input embedder	수M	Relative position, template feature 임베딩
Extra MSA stack (4 blocks)	5M	축소된 MSA 채널
Evoformer (48 blocks)	80M (bulk)	block당 ~1.7M; triangle ops는 linear projection 다수
Structure Module (8 layers, shared weights)	1.7M	가중치 공유 → 매우 작음
Heads (distogram, lDDT 등)	<1M
총계	93M (Supp. Table S1 보고; 외부 openfold 구현 검증)

왜 93M으로 SOTA인가? (inductive bias 분석)

MSA 공변량 신호: MSA 자체가 “진화가 저장한 co-evolution 데이터”로, 단일 서열 LM이 15B 파라미터로 학습해야 할 접촉 정보가 이미 주어짐. ESM-2 15B가 MSA 없이 구조 예측에 근접할 수 있으나, 정확도/파라미터 비로 AF2가 압도적.
Triangle inductive bias: pair representation에 거리의 삼각 부등식을 hard-coded. 일반 transformer로는 O(L³) 공간 제약을 O(L²) attention으로 흉내내기 어려움.
SE(3)-invariant IPA: equivariance를 구조로 보장 → data augmentation (random rotation)으로 학습하는 데이터 요구량 절감.
Weight sharing: Structure Module 8 layer가 동일 가중치 → recycling처럼 iterative refinement하면서 파라미터 ↓.
Recycling: test-time compute와 parameter를 분리. 더 많은 FLOPs ≠ 더 많은 parameter.

비교 매트릭스

모델	파라미터	MSA?	Template?	CASP/Test 정확도 (GDT_TS / TM-score)	특징
AlphaFold 2	93M	✅	✅	CASP14 median GDT_TS 92.4	Evoformer + IPA + FAPE
RoseTTAFold (Baek 2021)	130M	✅	✅	CASP14 median ~81	3-track (1D/2D/3D) simultaneous
ESMFold (Lin 2023)	3B (ESM-2 LM) + Folding trunk	❌ (LM embedding)	❌	CAMEO TM ~0.85 (AF2 ~0.88)	MSA-free, 60× faster inference
ESM-2 (15B)	15B	❌	❌	단일 서열 LM; ESMFold와 결합	Protein MLM
OmegaFold (Wu 2022)	670M	❌	❌	Orphan protein에서 AF2와 유사	MSA-free 단백질 LM
AlphaFold 3 (Abramson 2024)	비공개 (~수억)	✅ (reduced)	✅	AF2 대비 ligand/nucleic acid/PTM 포함	Diffusion-based structure head

파라미터 효율성: AF2는 ESM-2 15B의 ~0.6% 파라미터로 더 정확. 이는 도메인 특화 inductive bias의 승리를 보여줌.

결과 (Results)

CASP14 domain GDT_TS: median 92.4 (AF2) vs. ~75 (2위 그룹) → ~2배 정확도 격차.
Backbone Cα RMSD: 95th percentile < 1 Å, median 0.96 Å — 실험 오차 수준.
Side-chain χ1 정확도: ~80% (실험 구조 간 일치도에 근접).
Free modeling (FM) 타깃 (template 없음): 여전히 GDT_TS > 85, self-distillation의 핵심 효과.
pLDDT calibration: pLDDT와 실제 lDDT Pearson r ≈ 0.76 (per-residue).
Inference 시간: 단일 GPU (V100)에서 400 residue 단백질 ~수십 초.

발견 (Findings)

실험 정확도 도달: CASP14 타깃 중 ~2/3에서 실험 구조와 구분 불가 수준의 정확도. 단백질 접힘 문제의 “black-box” 해결.
Backbone vs side-chain: backbone은 거의 완벽(GDT_TS > 92), side-chain은 χ1 accuracy ~80%로 약간 뒤처짐 (flexible side-chain은 NMR에서도 변동).
Free modeling 강세: template 없는 도메인에서도 AF2는 median GDT_TS > 85, self-distillation이 결정적 기여.
pLDDT의 실용성: pLDDT < 50 영역은 대부분 intrinsically disordered region (IDR). Confidence가 “모른다”를 명시 → downstream 분석에서 신뢰도 필터.
Recycling의 효과: 1 → 3 cycle로 GDT_TS +2~4점. Parameter 증가 없이 accuracy 향상.

강점 (Strengths)

End-to-end differentiable pipeline.
기하학적 inductive bias (triangle, IPA, FAPE) 설계 우수.
Self-distillation으로 레이블 부족 극복.
pLDDT라는 신뢰도 추정이 실험가가 사용하기 직관적.
AlphaFold DB (~200M 단백질 공개)로 생물학 커뮤니티 전체를 재편.

한계 (Limitations)

MSA 의존: orphan protein (동족체 부족)에서는 성능 저하. → ESMFold/OmegaFold 등 MSA-free 후속 모델의 동기.
단일 체인: 복합체는 AF-Multimer(2022), AF3(2024)로 확장 필요.
Dynamics 미반영: 단일 conformation만 예측. Ensemble/allostery는 별도 기법 필요.
Ligand/PTM/nucleic acid 미지원: AF3에서 해결.
학습 비용: 128 TPU v3 × 11일 → 학계 재학습 장벽 (→ FastFold, ScaleFold, OpenFold 등 효율화 연구).
Supplementary 의존: 핵심 hyperparameter가 Supp. Table에 흩어져 있어 재현 난이도 높음.

한 줄 평 (Bottom Line)

93M 파라미터로 단백질 구조 예측을 실질적으로 해결한 AlphaFold 2는, 도메인 특화 기하학적 inductive bias(triangle/IPA/FAPE) + MSA + self-distillation의 결합이 brute-force scaling을 이긴 드문 사례이다.

다이어그램 (Architecture)

flowchart TB
    subgraph Input["Input Featurization"]
        SEQ["Target Sequence<br/>(length L)"]
        MSA_SEARCH["MSA Search<br/>UniRef90 / BFD / MGnify / UniClust30"]
        TPL_SEARCH["Template Search<br/>PDB70 (top 4)"]
        SEQ --> MSA_SEARCH
        SEQ --> TPL_SEARCH
    end

    subgraph Embed["Initial Embeddings"]
        M0["MSA m<br/>[N_seq, L, c_m=256]"]
        Z0["Pair z<br/>[L, L, c_z=128]"]
        MSA_SEARCH --> M0
        TPL_SEARCH --> Z0
        SEQ --> Z0
    end

    subgraph Evo["Evoformer (48 blocks)"]
        direction LR
        MSA_STACK["MSA Stack<br/>- Row-wise attn (pair bias)<br/>- Col-wise attn<br/>- Transition"]
        OPM["Outer Product Mean<br/>(MSA → pair)"]
        PAIR_STACK["Pair Stack<br/>- Triangle mult. (out/in)<br/>- Triangle attn (start/end)<br/>- Transition"]
        MSA_STACK -->|communicates| OPM
        OPM --> PAIR_STACK
        PAIR_STACK -->|pair bias| MSA_STACK
    end

    M0 --> Evo
    Z0 --> Evo

    subgraph Recycle["Recycling (×3, weight shared)"]
        direction TB
        REC_NOTE["stop-gradient except final cycle<br/>m0, z, Cα-distance re-embed"]
    end

    Evo -->|if cycle < 3| Recycle
    Recycle --> Evo

    subgraph SM["Structure Module (8 layers, shared weights)"]
        direction TB
        IPA["Invariant Point Attention<br/>(SE(3)-invariant)"]
        FRAME["Backbone Frame Update<br/>(quaternion + translation)"]
        CHI["Side-chain Torsion χ1-χ4"]
        IPA --> FRAME --> CHI
        CHI -->|iterate| IPA
    end

    Evo -->|final cycle: s, z| SM

    subgraph Output["Outputs + Heads"]
        ATOMS["All-atom 3D Structure"]
        PLDDT["pLDDT Confidence<br/>(per-residue 0-100)"]
        PAE["PAE (Predicted Aligned Error)"]
        DISTO["Distogram Head (aux)"]
        MMSA["Masked MSA Head (aux)"]
    end

    SM --> ATOMS
    SM --> PLDDT
    SM --> PAE
    Evo --> DISTO
    Evo --> MMSA

    subgraph Loss["Training Loss"]
        FAPE["FAPE Loss<br/>(backbone + side-chain,<br/>clamp=10Å)"]
        AUX["Auxiliary Losses<br/>distogram / masked MSA /<br/>pLDDT / exp.resolved / violation"]
    end

    ATOMS -.-> FAPE
    DISTO -.-> AUX
    MMSA -.-> AUX

핵심 수치 요약

항목	값	출처
Evoformer blocks	48	Main text Fig. 3
Structure Module layers	8 (weight-shared)	Main text
MSA channel c_m	256	Supp. 1.4
Pair channel c_z	128	Supp. 1.4
Single rep channel c_s	384	Supp.
Total parameters	~93M	Supp. Table (외부 재현 일치)
Training crop (initial)	256 residues	Supp. 1.11.3
Training crop (fine-tune)	384 residues	Supp. 1.11.3
Batch size	128	Supp. 1.11.3
Learning rate	1e-3 (Adam)	Supp. 1.11.3
Warmup	128K samples (~1K steps)	Supp. 1.11.3
LR decay	0.95× every 50K steps after 6.4M samples	Supp. 1.11.3
Total training samples	~10M (initial) + ~1.5M (fine-tune)	Supp.
Recycling iterations	3 (default, up to 4)	Main text
MSA masking rate	15% (BERT-style)	Supp.
FAPE clamp	10 Å (backbone)	Supp.
Self-distillation set	~350K UniClust30 representatives	Supp. 1.2.6
PDB chains	~170K (train)	Supp. 1.2.5
Training mix	75% self-distill + 25% PDB	Supp. 1.2.5
Hardware	128 TPU v3 cores	Supp. 1.11.3
Training time	~11 days (initial + fine-tune; 보고됨)	외부 재현, Supp.
CASP14 median GDT_TS	92.4	Main Fig. 1

추정값: 모듈별 파라미터 breakdown (Evoformer ~80M, Structure Module ~1.7M)은 공식 paper에 정확 수치 미기재, 보고되지 않음 (estimated from openfold reference implementation).

참고 링크

논문: https://www.nature.com/articles/s41586-021-03819-2
Supplementary: https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-021-03819-2/MediaObjects/41586_2021_3819_MOESM1_ESM.pdf
Code: https://github.com/google-deepmind/alphafold
AlphaFold DB: https://alphafold.ebi.ac.uk
Illustrated AlphaFold: https://elanapearl.github.io/blog/2024/the-illustrated-alphafold/
OpenFold (재구현): https://github.com/aqlaboratory/openfold

AlphaFold-2_2021_StructurePrediction