Elicit Machine Learning Reading List
Purpose
The purpose of this curriculum is to help new Elicit employees learn background in machine learning, with a focus on language models. I’ve tried to strike a balance between papers that are relevant for deploying ML in production and techniques that matter for longer-term scalability.
If you don’t work at Elicit yet - we’re hiring ML and software engineers.
How to read
Recommended reading order:
- Read “Tier 1” for all topics
- Read “Tier 2” for all topics
- Etc
Added after 2024/4/1
Table of contents
- Fundamentals
- Introduction to machine learning
- Transformers
- Key foundation model architectures
- Training and finetuning
- Reasoning and runtime strategies
- In-context reasoning
- Task decomposition
- Debate
- Tool use and scaffolding
- Honesty, factuality, and epistemics
- Applications
- Science
- Forecasting
- Search and ranking
- ML in practice
- Production deployment
- Benchmarks
- Datasets
- Advanced topics
- World models and causality
- Planning
- Uncertainty, calibration, and active learning
- Interpretability and model editing
- Reinforcement learning
- The big picture
- AI scaling
- AI safety
- Economic and social impacts
- Philosophy
- Maintainer
Fundamentals
Introduction to machine learning
Tier 1
- A short introduction to machine learning
- But what is a neural network?
- Gradient descent, how neural networks learn
Tier 2
- An intuitive understanding of backpropagation
- What is backpropagation really doing?
- An introduction to deep reinforcement learning
Tier 3
- The spelled-out intro to neural networks and backpropagation: building micrograd
- Backpropagation calculus
Transformers
Tier 1
- But what is a GPT? Visual intro to transformers
- Attention in transformers, visually explained
- Attention? Attention!
- The Illustrated Transformer
- The Illustrated GPT-2 (Visualizing Transformer Language Models)
Tier 2
- Let's build the GPT Tokenizer
- Neural Machine Translation by Jointly Learning to Align and Translate
- The Annotated Transformer
- Attention Is All You Need
Tier 3
- A Practical Survey on Faster and Lighter Transformers
- TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- A Mathematical Framework for Transformer Circuits
Tier 4+
- Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
- Memorizing Transformers
- Transformer Feed-Forward Layers Are Key-Value Memories
Key foundation model architectures
Tier 1
- Language Models are Unsupervised Multitask Learners (GPT-2)
- Language Models are Few-Shot Learners (GPT-3)
Tier 2
- LLaMA: Open and Efficient Foundation Language Models (LLaMA)
- Efficiently Modeling Long Sequences with Structured State Spaces (video) (S4)
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
- Evaluating Large Language Models Trained on Code (OpenAI Codex)
- Training language models to follow instructions with human feedback (OpenAI Instruct)
Tier 3
- Mistral 7B (Mistral)
- Mixtral of Experts (Mixtral)
- Gemini: A Family of Highly Capable Multimodal Models (Gemini)
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba)
- Scaling Instruction-Finetuned Language Models (Flan)
Tier 4+
- Consistency Models
- Model Card and Evaluations for Claude Models (Claude 2)
- OLMo: Accelerating the Science of Language Models
- PaLM 2 Technical Report (Palm 2)
- Textbooks Are All You Need II: phi-1.5 technical report (phi 1.5)
- Visual Instruction Tuning (LLaVA)
- A General Language Assistant as a Laboratory for Alignment
- Finetuned Language Models Are Zero-Shot Learners (Google Instruct)
- Galactica: A Large Language Model for Science
- LaMDA: Language Models for Dialog Applications (Google Dialog)
- OPT: Open Pre-trained Transformer Language Models (Meta GPT-3)
- PaLM: Scaling Language Modeling with Pathways (PaLM)
- Program Synthesis with Large Language Models (Google Codex)
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Gopher)
- Solving Quantitative Reasoning Problems with Language Models (Minerva)
- UL2: Unifying Language Learning Paradigms (UL2)
Training and finetuning
Tier 2
- Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- Learning to summarise with human feedback
- Training Verifiers to Solve Math Word Problems
Tier 3
- Pretraining Language Models with Human Preferences
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
- Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
- LoRA: Low-Rank Adaptation of Large Language Models
- Unsupervised Neural Machine Translation with Generative Language Models Only
Tier 4+
- Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
- Improving Code Generation by Training with Natural Language Feedback
- Language Modeling Is Compression
- LIMA: Less Is More for Alignment
- Learning to Compress Prompts with Gist Tokens
- Lost in the Middle: How Language Models Use Long Contexts
- QLoRA: Efficient Finetuning of Quantized LLMs
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- Reinforced Self-Training (ReST) for Language Modeling
- Solving olympiad geometry without human demonstrations
- Tell, don't show: Declarative facts influence how LLMs generalize
- Textbooks Are All You Need
- TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
- Training Language Models with Language Feedback at Scale
- Turing Complete Transformers: Two Transformers Are More Powerful Than One
- ByT5: Towards a token-free future with pre-trained byte-to-byte models
- Data Distributional Properties Drive Emergent In-Context Learning in Transformers
- Diffusion-LM Improves Controllable Text Generation
- ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
- Efficient Training of Language Models to Fill in the Middle
- ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
- True Few-Shot Learning with Prompts -- A Real-World Perspective
Reasoning and runtime strategies
In-context reasoning
Tier 2
- Chain of Thought Prompting Elicits Reasoning in Large Language Models
- Large Language Models are Zero-Shot Reasoners (Let's think step by step)
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
Tier 3
- Chain-of-Thought Reasoning Without Prompting
- Why think step-by-step? Reasoning emerges from the locality of experience
Tier 4+
- Baldur: Whole-Proof Generation and Repair with Large Language Models
- Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
- Certified Reasoning with Language Models
- Hypothesis Search: Inductive Reasoning with Language Models
- LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations
- Large Language Models Cannot Self-Correct Reasoning Yet
- Stream of Search (SoS): Learning to Search in Language
- Training Chain-of-Thought via Latent-Variable Inference
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right
Task decomposition
Tier 1
- Supervise Process, not Outcomes
- Supervising strong learners by amplifying weak experts
Tier 2
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Factored cognition
- Iterated Distillation and Amplification
- Recursively Summarizing Books with Human Feedback
- Solving math word problems with process-based and outcome-based feedback
Tier 3
- Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers
- Faithful Reasoning Using Large Language Models
- Humans consulting HCH
- Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
- Language Model Cascades
Tier 4+
- Decontextualization: Making Sentences Stand-Alone
- Factored Cognition Primer
- Graph of Thoughts: Solving Elaborate Problems with Large Language Models
- Parsel: A Unified Natural Language Framework for Algorithmic Reasoning
- AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts
- Challenging BIG-Bench tasks and whether chain-of-thought can solve them
- Evaluating Arguments One Step at a Time
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
- Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
- Measuring and narrowing the compositionality gap in language models
- PAL: Program-aided Language Models
- ReAct: Synergizing Reasoning and Acting in Language Models
- Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning
- Show Your Work: Scratchpads for Intermediate Computation with Language Models
- Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents
- Thinksum: probabilistic reasoning over sets using large language models
Debate
Tier 2
Tier 3
- Debate Helps Supervise Unreliable Experts
- Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions
Tier 4+
- Scalable AI Safety via Doubly-Efficient Debate
- Improving Factuality and Reasoning in Language Models through Multiagent Debate
Tool use and scaffolding
Tier 2
- Measuring the impact of post-training enhancements
- WebGPT: Browser-assisted question-answering with human feedback
Tier 3
- AI capabilities can be significantly improved without expensive retraining
- Automated Statistical Model Discovery with Language Models
Tier 4+
- DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
- Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
- Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
- Voyager: An Open-Ended Embodied Agent with Large Language Models
- ReGAL: Refactoring Programs to Discover Generalizable Abstractions
Honesty, factuality, and epistemics
Tier 2
- Self-critiquing models for assisting human evaluators
Tier 3
- What Evidence Do Language Models Find Convincing?
- How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Tier 4+
- Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
- Long-form factuality in large language models
Applications
Science
Tier 3
- Can large language models provide useful feedback on research papers? A large-scale empirical analysis
- Large Language Models Encode Clinical Knowledge
- The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
- A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Tier 4+
- Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
- Nougat: Neural Optical Understanding for Academic Documents
- Scim: Intelligent Skimming Support for Scientific Papers
- SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction and Drug Design
- Towards Accurate Differential Diagnosis with Large Language Models
- Towards a Benchmark for Scientific Understanding in Humans and Machines
- A Search Engine for Discovery of Scientific Challenges and Directions
- A full systematic review was completed in 2 weeks using automation tools: a case study
- Fact or Fiction: Verifying Scientific Claims
- Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
- PEER: A Collaborative Language Model
- PubMedQA: A Dataset for Biomedical Research Question Answering
- SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts
- SciTail: A Textual Entailment Dataset from Science Question Answering
Forecasting
Tier 3
- AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy
- Approaching Human-Level Forecasting with Language Models
- Are Transformers Effective for Time Series Forecasting?
- Forecasting Future World Events with Neural Networks
Search and ranking
Tier 2
- Learning Dense Representations of Phrases at Scale
- Text and Code Embeddings by Contrastive Pre-Training (OpenAI embeddings)
Tier 3
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
- Not All Vector Databases Are Made Equal
- REALM: Retrieval-Augmented Language Model Pre-Training
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Task-aware Retrieval with Instructions
Tier 4+
- RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
- Some Common Mistakes In IR Evaluation, And How They Can Be Avoided
- Boosting Search Engines with Interactive Agents
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
- Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
- UnifiedQA: Crossing Format Boundaries With a Single QA System
ML in practice
Production deployment
Tier 1
- Machine Learning in Python: Main developments and technology trends in data science, machine learning, and AI
- Machine Learning: The High Interest Credit Card of Technical Debt
Tier 2
- Designing Data-Intensive Applications
- A Recipe for Training Neural Networks
Benchmarks
Tier 2
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
Tier 3
- FLEX: Unifying Evaluation for Few-Shot NLP
- Holistic Evaluation of Language Models (HELM)
- Measuring Massive Multitask Language Understanding
- RAFT: A Real-World Few-Shot Text Classification Benchmark
- True Few-Shot Learning with Language Models
Tier 4+
- GAIA: a benchmark for General AI Assistants
- ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers
- Measuring Mathematical Problem Solving With the MATH Dataset
- QuALITY: Question Answering with Long Input Texts, Yes!
- SCROLLS: Standardized CompaRison Over Long Language Sequences
- What Will it Take to Fix Benchmarking in Natural Language Understanding?
Datasets
Tier 2
- Common Crawl
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Tier 3
- Dialog Inpainting: Turning Documents into Dialogs
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
- Microsoft Academic Graph
- TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts
Advanced topics
World models and causality
Tier 3
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
- From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
- Language Models Represent Space and Time
Tier 4+
- Amortizing intractable inference in large language models
- CLADDER: Assessing Causal Reasoning in Language Models
- Causal Bayesian Optimization
- Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
- Generative Agents: Interactive Simulacra of Human Behavior
- Passive learning of active causal strategies in agents and language models
Planning
Tier 4+
- Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
- Cognitive Architectures for Language Agents
Uncertainty, calibration, and active learning
Tier 2
- Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs
- A Simple Baseline for Bayesian Uncertainty in Deep Learning
- Plex: Towards Reliability using Pretrained Large Model Extensions
Tier 3
- Active Preference Inference using Language Models and Probabilistic Reasoning
- Eliciting Human Preferences with Language Models
- Active Learning by Acquiring Contrastive Examples
- Describing Differences between Text Distributions with Natural Language
- Teaching Models to Express Their Uncertainty in Words
Tier 4+
- Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning
- STaR-GATE: Teaching Language Models to Ask Clarifying Questions
- Active Testing: Sample-Efficient Model Evaluation
- Uncertainty Estimation for Language Reward Models
Interpretability and model editing
Tier 2
- Discovering Latent Knowledge in Language Models Without Supervision
Tier 3
- Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
- Representation Engineering: A Top-Down Approach to AI Transparency
- Studying Large Language Model Generalization with Influence Functions
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Tier 4+
- Codebook Features: Sparse and Discrete Interpretability for Neural Networks
- Eliciting Latent Predictions from Transformers with the Tuned Lens
- How do Language Models Bind Entities in Context?
- Opening the AI black box: program synthesis via mechanistic interpretability
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- Uncovering mesa-optimization algorithms in Transformers
- Fast Model Editing at Scale
- Git Re-Basin: Merging Models modulo Permutation Symmetries
- Locating and Editing Factual Associations in GPT
- Mass-Editing Memory in a Transformer
Reinforcement learning
Tier 2
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Reflexion: Language Agents with Verbal Reinforcement Learning
- Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)
- MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Tier 3
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
- AlphaStar: mastering the real-time strategy game StarCraft II
- Decision Transformer
- Mastering Atari Games with Limited Data (EfficientZero)
- Mastering Stratego, the classic game of imperfect information (DeepNash)
Tier 4+
- AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
- Bayesian Reinforcement Learning with Limited Cognitive Load
- Contrastive Prefence Learning: Learning from Human Feedback without RL
- Grandmaster-Level Chess Without Search
- A data-driven approach for learning to control computers
- Acquisition of Chess Knowledge in AlphaZero
- Player of Games
- Retrieval-Augmented Reinforcement Learning
The big picture
AI scaling
Tier 1
- Scaling Laws for Neural Language Models
- Takeoff speeds
- The Bitter Lesson
Tier 2
- AI and compute
- Scaling Laws for Transfer
- Training Compute-Optimal Large Language Models (Chinchilla)
Tier 3
- Emergent Abilities of Large Language Models
- Transcending Scaling Laws with 0.1% Extra Compute (U-PaLM)
Tier 4+
- Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
- Power Law Trends in Speedrunning and Machine Learning
- Scaling laws for single-agent reinforcement learning
- Beyond neural scaling laws: beating power law scaling via data pruning
- Emergent Abilities of Large Language Models
- Scaling Scaling Laws with Board Games
AI safety
Tier 1
- Three impacts of machine intelligence
- What failure looks like
- Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Tier 2
- An Overview of Catastrophic AI Risks
- Clarifying “What failure looks like” (part 1)
- Deep RL from human preferences
- The alignment problem from a deep learning perspective
Tier 3
- Scheming AIs: Will AIs fake alignment during training in order to get power?
- Measuring Progress on Scalable Oversight for Large Language Models
- Risks from Learned Optimization in Advanced Machine Learning Systems
- Scalable agent alignment via reward modelling
Tier 4+
- AI Deception: A Survey of Examples, Risks, and Potential Solutions
- Benchmarks for Detecting Measurement Tampering
- Chess as a Testing Grounds for the Oracle Approach to AI Safety
- Close the Gates to an Inhuman Future: How and why we should choose to not develop superhuman general-purpose artificial intelligence
- Model evaluation for extreme risks
- Responsible Reporting for Frontier AI Development
- Safety Cases: How to Justify the Safety of Advanced AI Systems
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure
- Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
- Tools for Verifying Neural Models' Training Data
- Towards a Cautious Scientist AI with Convergent Safety Bounds
- Alignment of Language Agents
- Eliciting Latent Knowledge
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Red Teaming Language Models with Language Models
- Unsolved Problems in ML Safety
Economic and social impacts
Tier 3
- Explosive growth from AI automation: A review of the arguments
- Language Models Can Reduce Asymmetry in Information Markets
Tier 4+
- Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero
- Foundation Models and Fair Use
- GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
- Levels of AGI: Operationalizing Progress on the Path to AGI
- Opportunities and Risks of LLMs for Scalable Deliberation with Polis
- On the Opportunities and Risks of Foundation Models
Philosophy
Tier 2
- Meaning without reference in large language models
Tier 4+
- Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
- Philosophers Ought to Develop, Theorize About, and Use Philosophically Relevant AI
- Towards Evaluating AI Systems for Moral Status Using Self-Reports
Maintainer
[email protected]