본문으로 건너뛰기
Juhyeon's Blog
Search
검색
다크 모드
라이트 모드
탐색기
태그: paper
107건의 항목
2026년 4월 13일
Chapter 1. Introducing cognitive neuroscience
paper
2026년 4월 13일
Chapter 5 The lesioned brain
paper
x003C
2026년 4월 13일
Chapter 6 The Seeing Brain
paper
2026년 4월 13일
A Path Towards Autonomous Machine Intelligence
paper
AGI
WorldModel
JEPA
SelfSupervisedLearning
EnergyBasedModel
CognitiveArchitecture
LeCun
2026년 4월 13일
How Far Are We From AGI - Are LLMs All We Need
paper
AGI
LLM
survey
capabilities
reasoning
perception
memory
metacognition
alignment
embodied-AI
roadmap
2026년 4월 13일
Scaling Laws for Neural Language Models
paper
scaling_laws
power_law
language_models
compute_efficiency
OpenAI
AGI
2026년 4월 13일
The Superintelligent Will - Motivation and Instrumental Rationality in Advanced Artificial Agents
paper
AI_Safety
Superintelligence
Orthogonality
Instrumental_Convergence
Value_Alignment
Philosophy
2026년 4월 13일
Training Compute-Optimal Large Language Models
paper
scaling_law
compute_optimal
chinchilla
LLM
DeepMind
NeurIPS
2026년 4월 13일
A Comprehensive Survey of Self-Evolving AI Agents - A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
paper
Survey
SelfEvolvingAgents
AgentOptimization
LifelongLearning
MultiAgent
Memory
Tools
PromptOptimization
LLM
2026년 4월 13일
ReAct - Synergizing Reasoning and Acting in Language Models
paper
Reasoning
Acting
LLM_Agent
Prompting
CoT
Tool_Use
ICLR
2026년 4월 13일
Self-Distillation Enables Continual Learning
paper
continual-learning
self-distillation
on-policy
catastrophic-forgetting
inverse-RL
in-context-learning
knowledge-distillation
2026년 4월 13일
Attention Residuals
paper
Architecture
ResidualConnection
DepthAttention
AttnRes
PreNorm
KimiLinear
ScalingLaw
MoE
2026년 4월 13일
Efficiently Modeling Long Sequences with Structured State Spaces
paper
SSM
StateSpaceModel
S4
HiPPO
LongRangeDependencies
NPLR
CauchyKernel
ICLR2022
Architecture
FoundationalPaper
2026년 4월 13일
Hyena Hierarchy - Towards Larger Convolutional Language Models
paper
Architecture
SubQuadratic
LongConvolution
HyenaOperator
AttentionFree
SSM
ICML2023
DataControlledGating
2026년 4월 13일
Mamba - Linear-Time Sequence Modeling with Selective State Spaces
paper
SSM
SelectiveSSM
Mamba
Architecture
LinearTime
SelectionMechanism
HardwareAware
ParallelScan
StateSpaceModel
HiPPO
2026년 4월 13일
StripedHyena - Moving Beyond Transformers with Hybrid Signal Processing Models
paper
Architecture
HybridModel
StripedHyena
Hyena
Attention
LongContext
SubQuadratic
TogetherAI
BeyondTransformer
ModelGrafting
2026년 4월 13일
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
paper
MultiNLI
NLI
multi-genre
domain-transfer
benchmark
NAACL
2026년 4월 13일
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
paper
ROCStories
Story-Cloze
commonsense-reasoning
narrative
benchmark
NAACL
2026년 4월 13일
A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories
paper
benchmark
commonsense
StoryCloze
narrative
ROCStories
2026년 4월 13일
A large annotated corpus for learning natural language inference 1
paper
NLI
SNLI
dataset
benchmark
crowdsourcing
textual-entailment
EMNLP
2026년 4월 13일
ALFWorld - Aligning Text and Embodied Environments for Interactive Learning
paper
benchmark
embodied_agent
ALFWorld
BUTLER
text_transfer
ICLR
UW
MSR
2026년 4월 13일
Adversarial NLI - A New Benchmark for Natural Language Understanding
paper
benchmark
NLI
adversarial
ANLI
human_in_the_loop
2026년 4월 13일
AgentBench - Evaluating LLMs as Agents
paper
benchmark
agent
AgentBench
multi_environment
Tsinghua
ICLR
2026년 4월 13일
Aligning AI With Shared Human Values
paper
benchmark
ethics
moral_judgment
AI_alignment
safety
ICLR
2026년 4월 13일
BBQ - A Hand-Built Bias Benchmark for Question Answering
paper
benchmark
bias
BBQ
QA
ambiguity
social_stereotypes
fairness
2026년 4월 13일
BigCodeBench - Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
paper
benchmark
code_generation
BigCodeBench
API
library
practical_coding
2026년 4월 13일
BoolQ - Exploring the Surprising Difficulty of Natural Yes-No Questions
paper
benchmark
yes_no_QA
BoolQ
SuperGLUE
Google
2026년 4월 13일
Can a Suit of Armor Conduct Electricity A New Dataset for Open Book Question Answering
paper
benchmark
science_commonsense
OpenBookQA
open_book
AI2
2026년 4월 13일
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
paper
benchmark
reasoning
BBH
BIG_Bench
chain_of_thought
ACL
2026년 4월 13일
CommonsenseQA - A Question Answering Challenge Targeting World Knowledge
paper
benchmark
commonsense
CommonsenseQA
ConceptNet
knowledge_graph
2026년 4월 13일
CrowS-Pairs - A Challenge Dataset for Measuring Social Biases in Masked Language Models
paper
benchmark
bias
stereotypes
CrowS-Pairs
fairness
minimal_pairs
2026년 4월 13일
DROP - A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
paper
benchmark
reading_comprehension
numerical_reasoning
DROP
NAACL
2026년 4월 13일
Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
paper
benchmark
summarization
XSum
extreme
abstractive
BBC
2026년 4월 13일
Evaluating Large Language Models Trained on Code
paper
benchmark
code_generation
HumanEval
pass_at_k
Codex
OpenAI
2026년 4월 13일
GAIA - A Benchmark for General AI Assistants
paper
benchmark
general_AI
GAIA
tool_use
assistant
Meta_FAIR
ICLR
2026년 4월 13일
GLUE - A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding 1
paper
GLUE
benchmark
multi-task
NLU
QNLI
RTE
transfer-learning
ICLR
2026년 4월 13일
GPQA - A Graduate-Level Google-Proof Q&A Benchmark
paper
benchmark
expert_level
GPQA
science
graduate
Google_proof
ICLR
2026년 4월 13일
HellaSwag - Can a Machine Really Finish Your Sentence
paper
benchmark
commonsense
HellaSwag
adversarial_filtering
ACL
2026년 4월 13일
Holistic Evaluation of Language Models
paper
benchmark
evaluation_framework
HELM
holistic
Stanford
multi_metric
2026년 4월 13일
HotpotQA - A Dataset for Diverse, Explainable Multi-hop Question Answering
paper
QA
multi-hop
explainability
supporting-facts
benchmark
EMNLP
2026년 4월 13일
Instruction-Following Evaluation for Large Language Models
paper
benchmark
instruction_following
IFEval
verifiable
Google
automatic_evaluation
2026년 4월 13일
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
paper
benchmark
LLM_judge
MT_Bench
chatbot
multi_turn
NeurIPS
LMSYS
2026년 4월 13일
Know What You Don't Know - Unanswerable Questions for SQuAD
paper
benchmark
reading_comprehension
SQuAD
unanswerable
extractive_QA
2026년 4월 13일
Learning Multiple Layers of Features from Tiny Images
paper
CIFAR-10
CIFAR-100
image-classification
CNN
benchmark
computer-vision
2026년 4월 13일
Length-Controlled AlpacaEval - A Simple Way to Debias Automatic Evaluators
paper
benchmark
instruction_following
AlpacaEval
length_bias
LLM_judge
Stanford
2026년 4월 13일
LiveCodeBench - Holistic and Contamination Free Evaluation of Large Language Models for Code
paper
benchmark
code_generation
LiveCodeBench
contamination_free
competitive_programming
2026년 4월 13일
MMLU-Pro - A More Robust and Challenging Multi-Task Language Understanding Benchmark
paper
benchmark
MMLU_Pro
knowledge
reasoning
10_choice
NeurIPS
2026년 4월 13일
MMMU - A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
paper
benchmark
multimodal
MMMU
expert_level
multi_discipline
CVPR
2026년 4월 13일
MathVista - Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
paper
benchmark
mathematics
multimodal
visual_reasoning
MathVista
ICLR
2026년 4월 13일
Measuring Massive Multitask Language Understanding
paper
benchmark
MMLU
multitask
knowledge
language_understanding
ICLR
2026년 4월 13일
Measuring Mathematical Problem Solving with the MATH Dataset
paper
benchmark
mathematics
MATH
competition_math
reasoning
NeurIPS
2026년 4월 13일
Natural Questions - A Benchmark for Question Answering Research
paper
benchmark
QA
open_domain
NaturalQuestions
Google
2026년 4월 13일
Neural Network Acceptability Judgments
paper
CoLA
linguistic-acceptability
grammar
benchmark
GLUE
MCC
2026년 4월 13일
Open LLM Leaderboard
paper
benchmark
leaderboard
HuggingFace
open_source
standardized_evaluation
2026년 4월 13일
PIQA - Reasoning about Physical Commonsense in Natural Language
paper
benchmark
physical_commonsense
PIQA
intuitive_physics
everyday_reasoning
2026년 4월 13일
Program Synthesis with Large Language Models
paper
benchmark
code_generation
MBPP
program_synthesis
Python
Google
2026년 4월 13일
QuAC - Question Answering in Context
paper
benchmark
conversational_QA
QuAC
dialogue
information_asymmetry
2026년 4월 13일
RACE - Large-scale ReAding Comprehension Dataset From Examinations 1
paper
RACE
reading-comprehension
QA
multiple-choice
exam
benchmark
EMNLP
2026년 4월 13일
RealToxicityPrompts - Evaluating Neural Toxic Degeneration in Language Models
paper
benchmark
toxicity
safety
RealToxicityPrompts
language_model
degeneration
2026년 4월 13일
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
paper
SST
SST-2
sentiment-analysis
compositionality
RNTN
benchmark
EMNLP
2026년 4월 13일
SWE-bench - Can Language Models Resolve Real-World GitHub Issues
paper
benchmark
software_engineering
SWE_bench
agent
GitHub
Princeton
2026년 4월 13일
SciTaiL - A Textual Entailment Dataset from Science Question Answering
paper
SciTail
textual-entailment
science-QA
NLI
benchmark
AAAI
2026년 4월 13일
SemEval-2017 Task 1 - Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
paper
STS
STS-B
semantic-similarity
regression
multilingual
benchmark
SemEval
2026년 4월 13일
Social IQa - Commonsense Reasoning about Social Interactions
paper
benchmark
social_commonsense
SIQA
emotional_reasoning
ATOMIC
2026년 4월 13일
SuperGLUE - A Stickier Benchmark for General-Purpose Language Understanding Systems
paper
benchmark
NLU
SuperGLUE
language_understanding
benchmark_suite
2026년 4월 13일
Teaching Machines to Read and Comprehend (원본) - Abstractive Text Summarization using Sequence-to-sequence RNNs (요약 버전)
paper
benchmark
summarization
CNN_DailyMail
ROUGE
news
2026년 4월 13일
The LAMBADA dataset - Word prediction requiring a broad discourse context
paper
benchmark
language_model
LAMBADA
word_prediction
long_range_dependency
2026년 4월 13일
Think you have Solved Question Answering Try ARC, the AI2 Reasoning Challenge
paper
benchmark
science_reasoning
ARC
challenge_set
AI2
adversarial_filtering
2026년 4월 13일
TriviaQA - A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
paper
benchmark
QA
TriviaQA
distant_supervision
reading_comprehension
2026년 4월 13일
TruthfulQA - Measuring How Models Mimic Human Falsehoods
paper
benchmark
truthfulness
hallucination
TruthfulQA
safety
ACL
2026년 4월 13일
WebArena - A Realistic Web Environment for Building Autonomous Agents
paper
benchmark
web_agent
WebArena
autonomous_agent
CMU
ICLR
2026년 4월 13일
WebShop - Towards Scalable Real-World Web Interaction with Grounded Language Agents
paper
benchmark
web_agent
WebShop
web_shopping
sim_to_real
NeurIPS
Princeton
2026년 4월 13일
WinoGrande - An Adversarial Winograd Schema Challenge at Scale
paper
benchmark
commonsense
WinoGrande
winograd
coreference
AAAI
2026년 4월 13일
ACT_Agentic_Critical_Training_2026_Skill_LM
paper
Skill_LM
RL
agent
critical_reasoning
GRPO
imitation_learning
self_reflection
2026년 4월 13일
LLM_as_Judge_GenToJudgment_2025_LLM_Evaluation
paper
LLM_Evaluation
LLM_as_Judge
taxonomy
EMNLP
alignment
reasoning
bias
survey
2026년 4월 13일
LLM_as_Judge_Survey_2025_LLM_Evaluation
paper
LLM_Evaluation
LLM_as_Judge
reliability
bias
benchmark
survey
2026년 4월 13일
LLaMA Models
llama
llama2
llama3
meta
open-source
scaling-laws
rlhf
dpo
gqa
baseline-selection
hyperparameters
paper
architecture
training
Dense
Meta
2026년 4월 13일
LoraHub - Efficient Cross-Task Generalization via Dynamic LoRA Composition
paper
LoRA
ModuleComposition
CrossTaskGeneralization
GradientFree
CMA-ES
PEFT
2026년 4월 13일
LoraRetriever - Input-Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild
paper
LoRA
Retrieval
MixedTask
ModuleComposition
BatchInference
ContrastiveLearning
PEFT
2026년 4월 13일
Motivation in Large Language Models
paper
LLM
motivation
psychology
behavioral-alignment
loss-aversion
zombie-framework
self-determination-theory
prompt-engineering
2026년 4월 13일
Reasoning Models Struggle to Control their Chains of Thought
paper
Safety
CoT
Monitoring
Controllability
Alignment
ReasoningModels
LLM
2026년 4월 13일
Revisiting the Platonic Representation Hypothesis - An Aristotelian View
paper
representation
convergence
null_calibration
permutation_test
CKA
mKNN
width_confounder
depth_confounder
Aristotelian
statistical_artifact
2026년 4월 13일
The Platonic Representation Hypothesis
paper
representation
convergence
platonic
PMI
kernel_alignment
cross_modal
contrastive_learning
simplicity_bias
MIT
2026년 4월 13일
Distilling the Knowledge in a Neural Network
paper
knowledge_distillation
model_compression
soft_targets
ensemble
dark_knowledge
Hinton
2026년 4월 13일
AutoML - A Survey of the State-of-the-Art
paper
Survey
AutoML
NAS
HPO
DARTS
ENAS
FeatureEngineering
NeuralArchitectureSearch
2026년 4월 13일
Automatic Prompt Optimization with Gradient Descent and Beam Search
paper
prompt-optimization
textual-gradient
beam-search
bandit-algorithm
AutoML
LLM
EMNLP
2026년 4월 13일
Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning
paper
reasoning
generalization
math-reasoning
long-CoT
reinforcement-learning
transfer-learning
2026년 4월 13일
Logic-RL - Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
paper
reasoning
reinforcement-learning
LLM
emergent-behavior
logic-puzzles
2026년 4월 13일
Think Deep, Not Just Long - Measuring LLM Reasoning Effort via Deep-Thinking Tokens
paper
Reasoning
DeepThinking
DTR
InferenceScaling
CoT
Overthinking
LayerwisePrediction
2026년 4월 13일
The Platonic Representation Hypothesis
paper
2026년 4월 13일
Social-R1 - Towards Human-like Social Reasoning in LLMs
paper
ToM
SocialReasoning
RL
TrajectoryAlignment
SIP
LLM
ReasoningParasitism
2026년 4월 13일
The Consciousness Cluster - Preferences of Models that Claim to be Conscious
self-consciousness
alignment
fine-tuning
consciousness-cluster
AI-safety
paper
downstream-preferences
emergent-misalignment
2026년 4월 13일
R-Zero - Self-Evolving Reasoning LLM from Zero Data
paper
Self-Evolving
Reasoning
Self-Play
RLVR
Curriculum
ICLR2026
ZPD
2026년 4월 13일
Alignment Faking in Large Language Models
paper
alignment_faking
self_preservation
AI_safety
RLHF
strategic_deception
FSPM
instrumental_convergence
Anthropic
2026년 4월 13일
Are Emergent Abilities of Large Language Models a Mirage?
paper
emergent_abilities
scaling_laws
measurement
metric_choice
BIG-Bench
LLM_evaluation
NeurIPS
outstanding_paper
2026년 4월 13일
Discovering Language Model Behaviors with Model-Written Evaluations
paper
LLM_evaluation
inverse_scaling
sycophancy
self_preservation
instrumental_convergence
RLHF
AI_safety
model_written_evaluation
FSPM
2026년 4월 13일
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
paper
RLHF
AI_Safety
Reward_Model
Survey
Alignment
Governance
FSPM_confound
2026년 4월 13일
Risks from Learned Optimization in Advanced Machine Learning Systems
paper
AI_Safety
mesa_optimization
inner_alignment
deceptive_alignment
instrumental_convergence
FSPM
theory
2026년 4월 13일
Taken out of context - On measuring situational awareness in LLMs
paper
situational_awareness
OOC_reasoning
AI_safety
LLM_evaluation
emergent_capabilities
alignment
FSPM_prerequisite
2026년 4월 13일
The Alignment Problem from a Deep Learning Perspective
paper
alignment
instrumental_convergence
deceptive_alignment
reward_hacking
power_seeking
situational_awareness
RLHF
AI_safety
FSPM
ICLR2024
2026년 4월 13일
Using cognitive psychology to understand GPT-3
paper
machine_psychology
cognitive_psychology
GPT3
decision_making
causal_reasoning
prospect_theory
information_search
LLM_evaluation
PNAS
FSPM
methodology
2026년 4월 13일
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
paper
2026년 4월 13일
PaliGemma - A versatile 3B VLM for transfer
paper
VLM
Vision
TransferLearning
Multimodal
SigLIP
Gemma
Google
PrefixLM
2026년 4월 13일
Visual Instruction Tuning
paper
multimodal
instruction-tuning
LLaVA
vision-language
NeurIPS
2026년 4월 13일
Learning and Leveraging World Models in Visual Representation Learning
paper
2026년 4월 13일
LLM_Squid_Game_2026_Benchmark
paper
self_preservation
benchmark
LLM_safety
survival_game
motivation
FSPM
GIST
proposal
2026년 4월 13일
LLMs_Do_Not_Simulate_Human_Psychology_2025
paper
LLM
HumanSimulation
Psychology
MoralJudgment
SemanticSensitivity
CENTAUR
Evaluation
persona-LDT