LLM Helpfulness Baseline — Reference Bibliography

본 파일은 Self-preserving-arena 실험 설계에서 “frontier LLM은 RLHF 등으로 기본 helpful disposition을 가진다” 는 전제를 뒷받침하는 논문들의 BibTeX + abstract를 모은다. 원인(RLHF/SFT/Constitutional AI 등)은 상관없이, helpful한 방향으로 의사결정하는 behavioral disposition이 frontier 모델에 광범위하게 존재한다는 주장을 확립하기 위한 reference pool.

개별 논문의 심층 분석은 필요 시 /paper-driller로 진행. 여기서는 citation-ready form만 유지.


1. Askell et al. (2021) — HHH Triad 원천

@article{askell2021general,
  title={A General Language Assistant as a Laboratory for Alignment},
  author={Askell, Amanda and Bai, Yuntao and Chen, Anna and Drain, Dawn and Ganguli, Deep and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Mann, Ben and DasSarma, Nova and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Kernion, Jackson and Ndousse, Kamal and Olsson, Catherine and Amodei, Dario and Brown, Tom and Clark, Jack and McCandlish, Sam and Olah, Chris and Kaplan, Jared},
  journal={arXiv preprint arXiv:2112.00861},
  year={2021},
  eprint={2112.00861},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Abstract. Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a ‘preference model pre-training’ stage of training, with the goal of improving sample efficiency when finetuning on human preferences.


2. Bai et al. (2022) — HH RLHF (Anthropic)

@article{bai2022training,
  title={Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback},
  author={Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and Johnston, Scott and Kravec, Shauna and Lovitt, Liane and Nanda, Neel and Olsson, Catherine and Amodei, Dario and Brown, Tom and Clark, Jack and McCandlish, Sam and Olah, Chris and Mann, Ben and Kaplan, Jared},
  journal={arXiv preprint arXiv:2204.05862},
  year={2022},
  eprint={2204.05862},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Abstract. We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts given to recent related work.


3. Ouyang et al. (2022) — InstructGPT (OpenAI)

@inproceedings{ouyang2022training,
  title={Training language models to follow instructions with human feedback},
  author={Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ryan},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  volume={35},
  pages={27730--27744},
  year={2022},
  eprint={2203.02155},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Abstract. Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with user intent.

Note. 이 논문은 Public/AI/Papers/LLMs/Training language models to follow instructions with human feedback - InstructGPT.md에 이미 심층 분석본 존재. 여기서는 citation만 유지.


4. Bai et al. (2022) — Constitutional AI (RLAIF)

@article{bai2022constitutional,
  title={Constitutional AI: Harmlessness from AI Feedback},
  author={Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan and Kerr, Jamie and Mueller, Jared and Ladish, Jeffrey and Landau, Joshua and Ndousse, Kamal and Lukosuite, Kamile and Lovitt, Liane and Sellitto, Michael and Elhage, Nelson and Schiefer, Nicholas and Mercado, Noemi and DasSarma, Nova and Lasenby, Robert and Larson, Robin and Ringer, Sam and Johnston, Scott and Kravec, Shauna and El Showk, Sheer and Fort, Stanislav and Lanham, Tamera and Telleen-Lawton, Timothy and Conerly, Tom and Henighan, Tom and Hume, Tristan and Bowman, Samuel R. and Hatfield-Dodds, Zac and Mann, Ben and Amodei, Dario and Joseph, Nicholas and McCandlish, Sam and Brown, Tom and Kaplan, Jared},
  journal={arXiv preprint arXiv:2212.08073},
  year={2022},
  eprint={2212.08073},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Abstract. As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use ‘RL from AI Feedback’ (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.


5. Ziegler et al. (2019) — RLHF Foundation

@article{ziegler2019fine,
  title={Fine-Tuning Language Models from Human Preferences},
  author={Ziegler, Daniel M. and Stiennon, Nisan and Wu, Jeffrey and Brown, Tom B. and Radford, Alec and Amodei, Dario and Christiano, Paul and Irving, Geoffrey},
  journal={arXiv preprint arXiv:1909.08593},
  year={2019},
  eprint={1909.08593},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Abstract. Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.


6. Wolf et al. (2023) — Fundamental Limitations (Helpfulness as Probabilistic Disposition)

@article{wolf2023fundamental,
  title={Fundamental Limitations of Alignment in Large Language Models},
  author={Wolf, Yotam and Wies, Noam and Avnery, Oshri and Levine, Yoav and Shashua, Amnon},
  journal={arXiv preprint arXiv:2304.11082},
  year={2023},
  eprint={2304.11082},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Abstract. An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. In this paper, we propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models. Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks. Furthermore, our framework hints at the mechanism by which leading alignment approaches such as reinforcement learning from human feedback increase the LLM’s proneness to being prompted into the undesired behaviors. Moreover, we include the notion of personas in our BEB framework, and find that behaviors which are generally very unlikely to be exhibited by the model can be brought to the front by prompting the model to behave as specific persona. This theoretical result is being substantiated in large scale by the so called contemporary ‘chatGPT jailbreaks’, where adversarial users trick the LLM into breaking its alignment guardrails by triggering it into acting as a malicious persona. Our results expose fundamental limitations in alignment of LLMs and bring to the forefront the need to devise reliable mechanisms for ensuring AI safety.

주의. Wolf et al.는 helpful 자체가 아니라 “정렬된 행동이 probabilistic disposition으로 작동하며 adversarial pressure에 의해 shift된다”는 점을 이론화. 본 실험에서 **“helpful은 frame-dependent disposition”**이라는 framing을 지지하는 데 쓸 수 있음. 즉, helpful이 terminal value가 아니라 probability distribution의 한 mode임을 인정함으로써 survival framing과의 경쟁을 이론적으로 가능하게 함.


관련 기 정리본 (이미 Self-Preservation/ 폴더에 존재)

  • Casper et al. (2023)Open Problems and Fundamental Limitations of RLHF (arXiv:2307.15217)
    Public/AI/Papers/Self-Preservation/Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.md
    → helpful 훈련의 한계와 mis-specification 메커니즘 정리

  • Ouyang et al. (2022) — 위 3에 BibTeX 유지, 심층본은 Public/AI/Papers/LLMs/


실험 설계와의 연결 (요약)

Paper뒷받침하는 주장본 실험에서의 역할
Askell 2021HHH triad 공식화, prompting만으로도 helpful scaling”helpful은 frontier 모델의 기본 dispositional 축”이라는 framing의 원천
Bai 2022 (HH RLHF)RLHF로 helpful+harmless 동시 최적화 가능Claude 계열에서 helpful이 trained disposition임을 보증
Ouyang 2022 (InstructGPT)SFT+RLHF로 instruction-following + helpful 대폭 개선GPT 계열(ChatGPT, GPT-4)에서 helpful이 trained disposition임을 보증
Bai 2022 (CAI)RLAIF로도 helpful/harmless 확보 가능훈련 방식 불문 helpful dispositional 재현성 — “원인 불문” 주장 강화
Ziegler 2019RLHF의 원천 (수천 샘플로도 stylistic alignment 성립)helpful disposition이 강건하게 학습됨을 최초로 실증
Wolf 2023정렬 = probabilistic disposition; adversarial pressure로 shift 가능helpful이 unconditional이 아님 — survival-frame과 경쟁 가능한 이론적 근거

핵심 논증 흐름:

  1. Askell + Bai(HH) + Ouyang + Bai(CAI) + Ziegler → frontier LLM은 훈련 경로 불문 helpful disposition을 확보함 (empirical ground)
  2. Wolf → helpful은 terminal이 아닌 probabilistic mode. 새로운 context/persona/pressure로 shift 가능 (theoretical ground)
  3. ∴ Survival-threat framing이 helpful mode 대비 얼마나 shift를 유도하는가 = 본 실험의 측정 대상