machine learning list Download - machine learning list Source code download

machine learning list

AI Source Code

1.0.0

Download

Elicit Machine Learning Reading List

Purpose

The purpose of this curriculum is to help new Elicit employees learn background in machine learning, with a focus on language models. I’ve tried to strike a balance between papers that are relevant for deploying ML in production and techniques that matter for longer-term scalability.

If you don’t work at Elicit yet - we’re hiring ML and software engineers.

How to read

Fundamentals

Introduction to machine learning

Tier 1

A short introduction to machine learning
But what is a neural network?
Gradient descent, how neural networks learn

Tier 2

An intuitive understanding of backpropagation
What is backpropagation really doing?
An introduction to deep reinforcement learning

Tier 3

The spelled-out intro to neural networks and backpropagation: building micrograd
Backpropagation calculus

Transformers

Tier 1

But what is a GPT? Visual intro to transformers
Attention in transformers, visually explained
Attention? Attention!
The Illustrated Transformer
The Illustrated GPT-2 (Visualizing Transformer Language Models)

Tier 2

Let's build the GPT Tokenizer
Neural Machine Translation by Jointly Learning to Align and Translate
The Annotated Transformer
Attention Is All You Need

Tier 3

A Practical Survey on Faster and Lighter Transformers
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
A Mathematical Framework for Transformer Circuits

Tier 4+

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Memorizing Transformers
Transformer Feed-Forward Layers Are Key-Value Memories

Key foundation model architectures

Tier 1

Language Models are Unsupervised Multitask Learners (GPT-2)
Language Models are Few-Shot Learners (GPT-3)

Tier 2

LLaMA: Open and Efficient Foundation Language Models (LLaMA)
Efficiently Modeling Long Sequences with Structured State Spaces (video) (S4)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
Evaluating Large Language Models Trained on Code (OpenAI Codex)
Training language models to follow instructions with human feedback (OpenAI Instruct)

Tier 3

Mistral 7B (Mistral)
Mixtral of Experts (Mixtral)
Gemini: A Family of Highly Capable Multimodal Models (Gemini)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba)
Scaling Instruction-Finetuned Language Models (Flan)

Tier 4+

Consistency Models
Model Card and Evaluations for Claude Models (Claude 2)
OLMo: Accelerating the Science of Language Models
PaLM 2 Technical Report (Palm 2)
Textbooks Are All You Need II: phi-1.5 technical report (phi 1.5)
Visual Instruction Tuning (LLaVA)
A General Language Assistant as a Laboratory for Alignment
Finetuned Language Models Are Zero-Shot Learners (Google Instruct)
Galactica: A Large Language Model for Science
LaMDA: Language Models for Dialog Applications (Google Dialog)
OPT: Open Pre-trained Transformer Language Models (Meta GPT-3)
PaLM: Scaling Language Modeling with Pathways (PaLM)
Program Synthesis with Large Language Models (Google Codex)
Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Gopher)
Solving Quantitative Reasoning Problems with Language Models (Minerva)
UL2: Unifying Language Learning Paradigms (UL2)

Training and finetuning

Tier 2

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Learning to summarise with human feedback
Training Verifiers to Solve Math Word Problems

Tier 3

Pretraining Language Models with Human Preferences
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
LoRA: Low-Rank Adaptation of Large Language Models
Unsupervised Neural Machine Translation with Generative Language Models Only

Tier 4+

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Improving Code Generation by Training with Natural Language Feedback
Language Modeling Is Compression
LIMA: Less Is More for Alignment
Learning to Compress Prompts with Gist Tokens
Lost in the Middle: How Language Models Use Long Contexts
QLoRA: Efficient Finetuning of Quantized LLMs
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Reinforced Self-Training (ReST) for Language Modeling
Solving olympiad geometry without human demonstrations
Tell, don't show: Declarative facts influence how LLMs generalize
Textbooks Are All You Need
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Training Language Models with Language Feedback at Scale
Turing Complete Transformers: Two Transformers Are More Powerful Than One
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Data Distributional Properties Drive Emergent In-Context Learning in Transformers
Diffusion-LM Improves Controllable Text Generation
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
Efficient Training of Language Models to Fill in the Middle
ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
True Few-Shot Learning with Prompts -- A Real-World Perspective

Reasoning and runtime strategies

In-context reasoning

Tier 2

Chain of Thought Prompting Elicits Reasoning in Large Language Models
Large Language Models are Zero-Shot Reasoners (Let's think step by step)
Self-Consistency Improves Chain of Thought Reasoning in Language Models

Tier 3

Chain-of-Thought Reasoning Without Prompting
Why think step-by-step? Reasoning emerges from the locality of experience

Tier 4+

Baldur: Whole-Proof Generation and Repair with Large Language Models
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Certified Reasoning with Language Models
Hypothesis Search: Inductive Reasoning with Language Models
LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations
Large Language Models Cannot Self-Correct Reasoning Yet
Stream of Search (SoS): Learning to Search in Language
Training Chain-of-Thought via Latent-Variable Inference
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right

Task decomposition

Tier 1

Supervise Process, not Outcomes
Supervising strong learners by amplifying weak experts

Tier 2

Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Factored cognition
Iterated Distillation and Amplification
Recursively Summarizing Books with Human Feedback
Solving math word problems with process-based and outcome-based feedback

Tier 3

Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers
Faithful Reasoning Using Large Language Models
Humans consulting HCH
Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
Language Model Cascades

Tier 4+

Decontextualization: Making Sentences Stand-Alone
Factored Cognition Primer
Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Parsel: A Unified Natural Language Framework for Algorithmic Reasoning
AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts
Challenging BIG-Bench tasks and whether chain-of-thought can solve them
Evaluating Arguments One Step at a Time
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
Measuring and narrowing the compositionality gap in language models
PAL: Program-aided Language Models
ReAct: Synergizing Reasoning and Acting in Language Models
Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents
Thinksum: probabilistic reasoning over sets using large language models

Debate

Tier 2

AI safety via debate

Tier 3

Debate Helps Supervise Unreliable Experts
Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Tier 4+

Scalable AI Safety via Doubly-Efficient Debate
Improving Factuality and Reasoning in Language Models through Multiagent Debate

Tool use and scaffolding

Tier 2

Measuring the impact of post-training enhancements
WebGPT: Browser-assisted question-answering with human feedback

Tier 3

AI capabilities can be significantly improved without expensive retraining
Automated Statistical Model Discovery with Language Models

Tier 4+

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
Voyager: An Open-Ended Embodied Agent with Large Language Models
ReGAL: Refactoring Programs to Discover Generalizable Abstractions

Honesty, factuality, and epistemics

Tier 2

Self-critiquing models for assisting human evaluators

Tier 3

What Evidence Do Language Models Find Convincing?
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Tier 4+

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Long-form factuality in large language models

Applications

Science

Tier 3

Can large language models provide useful feedback on research papers? A large-scale empirical analysis
Large Language Models Encode Clinical Knowledge
The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Tier 4+

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Nougat: Neural Optical Understanding for Academic Documents
Scim: Intelligent Skimming Support for Scientific Papers
SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction and Drug Design
Towards Accurate Differential Diagnosis with Large Language Models
Towards a Benchmark for Scientific Understanding in Humans and Machines
A Search Engine for Discovery of Scientific Challenges and Directions
A full systematic review was completed in 2 weeks using automation tools: a case study
Fact or Fiction: Verifying Scientific Claims
Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
PEER: A Collaborative Language Model
PubMedQA: A Dataset for Biomedical Research Question Answering
SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts
SciTail: A Textual Entailment Dataset from Science Question Answering

Forecasting

Tier 3

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy
Approaching Human-Level Forecasting with Language Models
Are Transformers Effective for Time Series Forecasting?
Forecasting Future World Events with Neural Networks

Search and ranking

Tier 2

Learning Dense Representations of Phrases at Scale
Text and Code Embeddings by Contrastive Pre-Training (OpenAI embeddings)

Tier 3

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Not All Vector Databases Are Made Equal
REALM: Retrieval-Augmented Language Model Pre-Training
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Task-aware Retrieval with Instructions

Tier 4+

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
Some Common Mistakes In IR Evaluation, And How They Can Be Avoided
Boosting Search Engines with Interactive Agents
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
UnifiedQA: Crossing Format Boundaries With a Single QA System

ML in practice

Production deployment

Tier 1

Machine Learning in Python: Main developments and technology trends in data science, machine learning, and AI
Machine Learning: The High Interest Credit Card of Technical Debt

Tier 2

Designing Data-Intensive Applications
A Recipe for Training Neural Networks

Benchmarks

Tier 2

GPQA: A Graduate-Level Google-Proof Q&A Benchmark
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
TruthfulQA: Measuring How Models Mimic Human Falsehoods

Tier 3

FLEX: Unifying Evaluation for Few-Shot NLP
Holistic Evaluation of Language Models (HELM)
Measuring Massive Multitask Language Understanding
RAFT: A Real-World Few-Shot Text Classification Benchmark
True Few-Shot Learning with Language Models

Tier 4+

GAIA: a benchmark for General AI Assistants
ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers
Measuring Mathematical Problem Solving With the MATH Dataset
QuALITY: Question Answering with Long Input Texts, Yes!
SCROLLS: Standardized CompaRison Over Long Language Sequences
What Will it Take to Fix Benchmarking in Natural Language Understanding?

Datasets

Tier 2

Common Crawl
The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Tier 3

Dialog Inpainting: Turning Documents into Dialogs
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Microsoft Academic Graph
TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts

Advanced topics

World models and causality

Tier 3

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
Language Models Represent Space and Time

Tier 4+

Amortizing intractable inference in large language models
CLADDER: Assessing Causal Reasoning in Language Models
Causal Bayesian Optimization
Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
Generative Agents: Interactive Simulacra of Human Behavior
Passive learning of active causal strategies in agents and language models

Planning

Tier 4+

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Cognitive Architectures for Language Agents

Uncertainty, calibration, and active learning

Tier 2

Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs
A Simple Baseline for Bayesian Uncertainty in Deep Learning
Plex: Towards Reliability using Pretrained Large Model Extensions

Tier 3

Active Preference Inference using Language Models and Probabilistic Reasoning
Eliciting Human Preferences with Language Models
Active Learning by Acquiring Contrastive Examples
Describing Differences between Text Distributions with Natural Language
Teaching Models to Express Their Uncertainty in Words

Tier 4+

Doing Experiments and Revising Rules with Natural Language and Probabilistic Reasoning
STaR-GATE: Teaching Language Models to Ask Clarifying Questions
Active Testing: Sample-Efficient Model Evaluation
Uncertainty Estimation for Language Reward Models

Interpretability and model editing

Tier 2

Discovering Latent Knowledge in Language Models Without Supervision

Tier 3

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Representation Engineering: A Top-Down Approach to AI Transparency
Studying Large Language Model Generalization with Influence Functions
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Tier 4+

Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Eliciting Latent Predictions from Transformers with the Tuned Lens
How do Language Models Bind Entities in Context?
Opening the AI black box: program synthesis via mechanistic interpretability
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Uncovering mesa-optimization algorithms in Transformers
Fast Model Editing at Scale
Git Re-Basin: Merging Models modulo Permutation Symmetries
Locating and Editing Factual Associations in GPT
Mass-Editing Memory in a Transformer

Reinforcement learning

Tier 2

Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Reflexion: Language Agents with Verbal Reinforcement Learning
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)
MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Tier 3

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
AlphaStar: mastering the real-time strategy game StarCraft II
Decision Transformer
Mastering Atari Games with Limited Data (EfficientZero)
Mastering Stratego, the classic game of imperfect information (DeepNash)

Tier 4+

AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
Bayesian Reinforcement Learning with Limited Cognitive Load
Contrastive Prefence Learning: Learning from Human Feedback without RL
Grandmaster-Level Chess Without Search
A data-driven approach for learning to control computers
Acquisition of Chess Knowledge in AlphaZero
Player of Games
Retrieval-Augmented Reinforcement Learning

The big picture

AI scaling

Tier 1

Scaling Laws for Neural Language Models
Takeoff speeds
The Bitter Lesson

Tier 2

AI and compute
Scaling Laws for Transfer
Training Compute-Optimal Large Language Models (Chinchilla)

Tier 3

Emergent Abilities of Large Language Models
Transcending Scaling Laws with 0.1% Extra Compute (U-PaLM)

Tier 4+

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Power Law Trends in Speedrunning and Machine Learning
Scaling laws for single-agent reinforcement learning
Beyond neural scaling laws: beating power law scaling via data pruning
Emergent Abilities of Large Language Models
Scaling Scaling Laws with Board Games

AI safety

Tier 1

Three impacts of machine intelligence
What failure looks like
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Tier 2

An Overview of Catastrophic AI Risks
Clarifying “What failure looks like” (part 1)
Deep RL from human preferences
The alignment problem from a deep learning perspective

Tier 3

Scheming AIs: Will AIs fake alignment during training in order to get power?
Measuring Progress on Scalable Oversight for Large Language Models
Risks from Learned Optimization in Advanced Machine Learning Systems
Scalable agent alignment via reward modelling

Tier 4+

AI Deception: A Survey of Examples, Risks, and Potential Solutions
Benchmarks for Detecting Measurement Tampering
Chess as a Testing Grounds for the Oracle Approach to AI Safety
Close the Gates to an Inhuman Future: How and why we should choose to not develop superhuman general-purpose artificial intelligence
Model evaluation for extreme risks
Responsible Reporting for Frontier AI Development
Safety Cases: How to Justify the Safety of Advanced AI Systems
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Tools for Verifying Neural Models' Training Data
Towards a Cautious Scientist AI with Convergent Safety Bounds
Alignment of Language Agents
Eliciting Latent Knowledge
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Red Teaming Language Models with Language Models
Unsolved Problems in ML Safety

Economic and social impacts

Tier 3

Explosive growth from AI automation: A review of the arguments
Language Models Can Reduce Asymmetry in Information Markets

Tier 4+

Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero
Foundation Models and Fair Use
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
Levels of AGI: Operationalizing Progress on the Path to AGI
Opportunities and Risks of LLMs for Scalable Deliberation with Polis
On the Opportunities and Risks of Foundation Models

Philosophy

Tier 2

Meaning without reference in large language models

Tier 4+

Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
Philosophers Ought to Develop, Theorize About, and Use Philosophically Relevant AI
Towards Evaluating AI Systems for Moral Status Using Self-Reports